Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Fast-Mod algorithm for native EEHashTableBase #65778

Closed
EgorBo opened this issue Feb 23, 2022 · 7 comments · Fixed by #65926
Closed

Use Fast-Mod algorithm for native EEHashTableBase #65778

EgorBo opened this issue Feb 23, 2022 · 7 comments · Fixed by #65926
Labels
area-VM-coreclr tenet-performance Performance related issue
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Feb 23, 2022

When I profile this simple virtual generic callsite:

public class Prog
{
    static void InvokeVGM(Prog p)
    {
        while (true)
            p.VGM<string>();
    }

    virtual void VGM<T>() {}
}

in VTune I see div as the most expensive part (if I read it correctly):
image
It turns out to be a EEHashTable lookup:
image

And It does make sense since div is a very expensive operation, on some CPU its latency can be bigger than 100 cycles. On my Coffee Lake it's 26/6 while e.g. mul is 3/1 (Latency/rec.trput)

The managed Dictionary was optimized with "Fast Mod" algorithm where we pre-calculate a special multiplier every time a new bucket is added and then can avoid expensive idiv operation by using mul see dotnet/coreclr#27299 and #406

Quick demo:

public static uint Test(uint x)
{
    uint result = 0;

    ulong cachedMul = GetFastModMultiplier(y);
    for (int i = 0; i < x; i++)
    {
        // result = i % x;
        result = FastMod(i, x, cachedMul);
    }
    return result;
}

public static ulong GetFastModMultiplier(uint divisor) =>
    ulong.MaxValue / divisor + 1;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static uint FastMod(uint value, uint divisor, ulong multiplier) => 
    (uint)(((((multiplier * value) >> 32) + 1) * divisor) >> 32);

NOTE: the algorithm adds an overhead on dictionary expansion
NOTE2: CompareKeys is not inlined

@EgorBo EgorBo added the tenet-performance Performance related issue label Feb 23, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 23, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 23, 2022

cc @VSadov

@jkotas
Copy link
Member

jkotas commented Feb 23, 2022

@bartonjs
Copy link
Member

If the number of buckets is a power of 2 then mod is just a masking operation. (Though depending on the distribution of dwHash values it may create overweighted buckets)

@EgorBo
Copy link
Member Author

EgorBo commented Feb 23, 2022

I applied the FastMod locally and got 22% improvement for the benchmark - but I am not sure I do it in a race-condition free fashion.

@VSadov
Copy link
Member

VSadov commented Feb 23, 2022

Using FastMod could be an easy way to speed up EEHashTableBase. A few thoughts:

  • FastMod has two components - divisor and multiplier. They must match thus needing a bit more care regarding races. Storing multiplier in the table, the same way we store length, could be sufficient.

  • Are we sure the real problem is not collisions?
    Sometimes excessive hashing expenses is what we see, but the real problem is collisions.

  • Whether EEHashTableBase is a good choice for caching is another good question.
    If scenario that we have here is similar to the type casting cache (i.e. the values can be relatively cheaply recomputed), then a solution similar to cast cache could fit better.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 23, 2022

Prototype: EgorBo@c99fe02

the only problem that it assumes NumBuckets fits into 32bit (pretty normal) and dwHash - they always do, DWORD is 32bit

public class P
{
    [Benchmark]
    public void CallVGM()
    {
        VGM<string>();
    }

    virtual void VGM<T>()
    {
    }

    static void Main(string[] args) => 
        BenchmarkSwitcher.FromAssembly(typeof(P).Assembly).Run(args);
}
Method Toolchain Mean Error StdDev Ratio
CallVGM \Core_Root\corerun.exe 5.267 ns 0.1216 ns 0.1301 ns 1.00
CallVGM \Core_Root_PR\corerun.exe 4.452 ns 0.0342 ns 0.0319 ns 0.85

then a solution similar to cast cache could fit better.

I agree but that sounds like a more invasive change, while FastMod could also improve other EEHashTable users?

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 27, 2022
@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Feb 28, 2022
@mangod9 mangod9 added this to the 7.0.0 milestone Feb 28, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 1, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Apr 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants