-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized SHA2 implementation. #78
base: master
Are you sure you want to change the base?
Conversation
m128i* {.importc: "__m128i", x86type.} = object | ||
data: array[2, uint64] | ||
|
||
let |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let
introduces a global variable which, when you access it, is a side effect - this is (one of) the reasons func
can't be used. It should be safe to switch this to const
as long as only unaligned loads are used.
sha2 needs to be func
for lots of reasons, ie one should be able to assume that it has no side effects (in fact, our code doesn't compile with this branch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
is not working here, because when Nim encounters <const>[x]
in the code it substitutes it with value, but for AVX/AVX2 implementations i need constants to be stored in memory in specific order one by one. So AVX/AVX2 operations could load not single 64bit value but 4 or 8 values with one instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case, access to it needs to go through the same {.noSideEffect.}
trick so that the public interface remains side-effect-free.
This is really a bug in nim though, that there is no native mechanism to make this kind of code side-effect-free - because at the end of the day, there's actually no side effect in the generated code, not even with let
- it's just a semantic gap in nim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure about this {.noSideEffect.}
thing.
init()
updates context
- side effect present
update()
updates context
- side effect present
finish()
updates context
- side effect present
reset()
updates context
- side effect present
What the reason to avoid side effect for this specific API functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side effect in this context means that it depends or updates something outside of the given parameters - for example a global - the "test" to determine whether something has a side effect or not is to reason about whether two calls with the same inputs can yield different outputs - in this case, they can't - the reason the compiler doesn't understand this is due to a limitation in the nim compiler - it can't guarantee that a global variable (K0 in this case) keeps the same value between one call to update and the next call.
Hash functions by definition are side-effect-free, or they wouldn't be hash functions - ie the output should depend only on the inputs and not anything else - "side-effect-free" for example allows the compiler to remove repeated computation of the same thing - for i in 0..9: echo digest("hello")
can be rewritten by the compiler as let tmp = digest("hello"); for i in 0..9: echo tmp
if it can prove (or is told) that digest
is side-effect-free.
In libp2p for example, there's a type DigestFcn = proc(...): ... {.noSideEffect.}
- it requires that all digest functions are side-effect-free so that when someone uses a DigestFcn
, the compiler can optimise it like above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nim added (partially) support addressable constants iirc.
I made the request here nim-lang/RFCs#257 though I'm not sure what part is implemented and what not. cc @Araq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that we can take the address of a const since at least 1.6, see example with ssse3 impl of SHA256:
const K256* = [
0x428a2f98'u32, 0x71374491'u32, 0xb5c0fbcf'u32, 0xe9b5dba5'u32, 0x3956c25b'u32, 0x59f111f1'u32, 0x923f82a4'u32, 0xab1c5ed5'u32,
0xd807aa98'u32, 0x12835b01'u32, 0x243185be'u32, 0x550c7dc3'u32, 0x72be5d74'u32, 0x80deb1fe'u32, 0x9bdc06a7'u32, 0xc19bf174'u32,
0xe49b69c1'u32, 0xefbe4786'u32, 0x0fc19dc6'u32, 0x240ca1cc'u32, 0x2de92c6f'u32, 0x4a7484aa'u32, 0x5cb0a9dc'u32, 0x76f988da'u32,
0x983e5152'u32, 0xa831c66d'u32, 0xb00327c8'u32, 0xbf597fc7'u32, 0xc6e00bf3'u32, 0xd5a79147'u32, 0x06ca6351'u32, 0x14292967'u32,
0x27b70a85'u32, 0x2e1b2138'u32, 0x4d2c6dfc'u32, 0x53380d13'u32, 0x650a7354'u32, 0x766a0abb'u32, 0x81c2c92e'u32, 0x92722c85'u32,
0xa2bfe8a1'u32, 0xa81a664b'u32, 0xc24b8b70'u32, 0xc76c51a3'u32, 0xd192e819'u32, 0xd6990624'u32, 0xf40e3585'u32, 0x106aa070'u32,
0x19a4c116'u32, 0x1e376c08'u32, 0x2748774c'u32, 0x34b0bcb5'u32, 0x391c0cb3'u32, 0x4ed8aa4a'u32, 0x5b9cca4f'u32, 0x682e6ff3'u32,
0x748f82ee'u32, 0x78a5636f'u32, 0x84c87814'u32, 0x8cc70208'u32, 0x90befffa'u32, 0xa4506ceb'u32, 0xbef9a3f7'u32, 0xc67178f2'u32
]
const VecWords = 16 div sizeof(Word) # sizeof(m128i) / sizeof(Word)
func initMessageSchedule(
msnext: var array[VecNum, m128i],
ms: var Sha256_MessageSchedule,
message: ptr UncheckedArray[byte]) {.inline.} =
## Initial state, from data
## - Precompute steps for the future message schedule `msnext`
## - compute the current message schedule `ms`
let mask = setr_u32x4(0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f)
let pK256 = K256.unsafeAddr()
staticFor i, 0, VecNum:
msnext[i] = loadu_u128(message[i * sizeof(m128i)].addr)
msnext[i] = shuf_u8x16(msnext[i], mask)
storea_u128(ms.w[VecWords*i].addr, add_u32x4(msnext[i], loadu_u128(pK256[VecWords*i].addr)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it still doesn't support aligning though
|
||
proc init*(ctx: var Sha2Context, | ||
implementation = Sha2Implementation.Auto, | ||
cpufeatures: set[CpuFeature] = {}) {.noinit.} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with empty features by default, it becomes impractical to use the cpuid-based selection - users will end up with the slow ref implementation in most cases, including when using the convenience helpers found elsewhere in the library.
Also, this is not really a setting that is useful to have at the context level: we create a new context every time we hash something - doing the feature selection here is wasteful (hashing a beacon state creates millions of contexts).
The classic way to do this is something like:
let sha256update: proc(p: pointer, size: int) {.noSideEffect.} = selectBestImpl()
# alternatively:
var p = proc(p: pointer, size: int) =
p = selectBestImplementation()
p(pointer, size)
template callUpdate(p, s) =
# There's no side effect here logically, just that the compiler doesn't understand it
{.noSideEffect.}:
sha256Update(p, s)
there are fancier options but they tend to be overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its supposed that you will obtain CPU features list when application starts, and use it everywhere when you instantiate contexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use it everywhere when you instantiate contexts.
This is very impractical - ie generally, there exists no reason why you would want the slower implementation outside of benchmarking and testing, so it should be the default. This is how I would expect things to work, ie openssl, blst, hashtree etc all use autodection of this kind by default.
The most simple example is the nimcrypto itself that fails to do this, when calling sha256.digest(myvalue)
- it would be silly not to use the sha2ext version in this case since most computers have that by now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not impractical when you start tuning your application to use different types of algorithms for different purposes, for example:
You have CPU which does support only AVX, AVX2 and does not support SHAEXT. If your application is mostly hashes 64 byte blocks it is better to use AVX implementation, but if you supposed to use it to hash big blobs of data - you should start using AVX2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start tuning
I'm not talking about removing the ability to choose an algo - I'm highlighting that defaulting to "use the slowest algo" is impractical (the vast majority of users will not care about avx or avx2 - they just want a decently hash fast and certainly do not want the slowest one available)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at my sample code from above, what I'm proposing is that you have a function pointer that you replace with a cpu-specific implementation on first call - the first time it's called, it calls cpuid and picks the best general-purpose implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we discussing 2 points in this comment.
- How to optimize
compress
function invocation, using procedure variable in context. - What type of implementation should be used when no arguments passed in context initialization.
All my comments is exactly about point 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments are about both:
- a proc var removes case and assert overhead
- choosing a default value for it based on cpuid on first invokation is simple, easy, safe and desireable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you trying to propose me global variable or thread-local variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - for the default
A few benchmarks for hashing the beacon state - this is certainly not a exhaustive benchmark because it only tests 64-byte values, but it's still indicative on that particular sample size: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
|
AVX is faster than AVX2 because of data size... AVX2 implementation uses AVX implementation for 64 bytes data. |
ctx.module = ctx.getImplementation(Sha2Implementation.Auto, cpufeatures) | ||
ctx.reset() | ||
|
||
proc clear*(ctx: var Sha2Context) {.noinit.} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noinit?
template compress(ctx: var Sha2Context, data: openArray[byte], blocks: int) = | ||
when (ctx is sha224) or (ctx is sha256): | ||
mixin sha256Compress | ||
case ctx.module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a function pointer here should also result in better code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx.module
could be a proc(...) {.nimcall.}
that is set on init - then you don't need a case
traversal for every compress call
when compiles(SHA2_AVX_sha256Compress): | ||
sha2_avx.sha256Compress(ctx.state, data, blocks) | ||
else: | ||
raiseAssert "AVX implementation is not available for " & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this cannot happen really since init is supposed to pick a working version - asserts like this add a lot of bloat to the generated c code which tend to slow things down by a tiny bit. that said, a pointer-based impl wouldn't have this case statement at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its possible that some compilation settings will not bundle some of the implementations to application binary, but ctx.module
is variable which is set at runtime.
glength: int | ||
bytesWrite: int | ||
|
||
when not((M is byte) or (M is char)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not pbkdf2*[M, N: char|byte]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like workaround for some old Nim compiler version nimcrypto
supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in other places, the N: bchar
style is used
nimcrypto/sha2/sha2_neon.nim
Outdated
when defined(arm64): | ||
import "."/sha2_common | ||
|
||
{.passC: "-march=armv8-a+crypto".} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is localpassc enough here? this will override -march
for all c files which might not be desired
nimcrypto/sha2/sha2_ref.nim
Outdated
W[14] = beLoad32(data, offset + 56); W[15] = beLoad32(data, offset + 60) | ||
|
||
for i in 16 ..< 64: | ||
W[i] = SIG1(W[i - 2]) + W[i - 7] + SIG0(W[i - 15]) + W[i - 16] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
profiler says a lot of time is spent here - maybe a good target for unrolling? W should probably be aligned to 32/64 as well and maybe noinit?
nimcrypto/sha2/sha2_sha.nim
Outdated
when defined(amd64): | ||
import "."/sha2_common | ||
|
||
{.passC:"-msha".} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, localpassc - passc will enable these extensions for all c files which is a problem because gcc will then make use of these features in "general" code which is not hidden behind cpuid-based detection and the whole program will sigill on older machines.
Note that you can bench also vs Constantine which includes OpenSSL git clone https://github.com/mratsim/constantine
cd constantine
CC=clang nimble bench_sha256 |
from what I remember, |
Remove debugging echoes.
Should address #36