Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintroduce fast math functions #7495

Merged
merged 7 commits into from
Oct 1, 2024
Merged

Reintroduce fast math functions #7495

merged 7 commits into from
Oct 1, 2024

Conversation

messmerd
Copy link
Member

@messmerd messmerd commented Sep 12, 2024

Added the fastPow function back into lmms_math.h after it was mistakenly removed in #7382, and removed the undefined behavior caused by type punning with a union.

Replaced calls to std::fma with x * y + z (see discussion in review comments below).

I left the fast sqrt function out because it was unused.

include/lmms_math.h Outdated Show resolved Hide resolved
@DomClark DomClark self-requested a review September 12, 2024 17:47
Copy link
Contributor

@Rossmaxx Rossmaxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good.

include/lmms_math.h Outdated Show resolved Hide resolved
include/lmms_math.h Outdated Show resolved Hide resolved
@Rossmaxx
Copy link
Contributor

@messmerd Having thought of it a bit, how about we just do the raw a * b + c instead of calling fma? @LostRobotMusic is there any chance this might be faster, since like dom said, FP_FAST_FMA* is rarely defined.

@LostRobotMusic
Copy link
Contributor

LostRobotMusic commented Sep 25, 2024

@LostRobotMusic is there any chance this might be faster, since like dom said, FP_FAST_FMA* is rarely defined.

0 chance, because the code already there uses a * b + c when the fast FMA flag isn't set.

Having thought of it a bit, how about we just do the raw a * b + c instead of calling fma?

I have another idea... using fma instead of calling fma (explained below). But I have no strong opinion either way, a*b+c is already blazingly fast.

@DomClark @messmerd A quick check of Dom's findings show he's correct, when using that march flag both the FMA and non-FMA versions of the code compile to use vfmadd132sd for me in GCC. I think this is because it is enabling the -mfma flag, which provides the same results. This allows FP_FAST_FMA to be defined in GCC.

However, as messmerd mentioned on Discord, FMA has different precision behavior than the standard multiply-add, and I found it odd that the compiler would choose to optimize something in a way that objectively changes its behavior. As it turns out, GCC has -ffp-contract=fast by default, even with -fno-fast-math. Apparently these behavior differences are so incredibly minor that GCC decided to have the faster behavior enabled by default regardless of other settings.

(I don't recommend this ↓)
If we wanted to have full control over this precision behavior, while also making use of FMA where applicable, the solution would be to enable FMA with -mfma, prevent the compiler from optimizing things to use FMA automatically via -ffp-contract=off (I don't know what other things this would impact aside from FMA), and then use FMA directly whenever we want to make use of that...

(I recommend this, if possible ↓)
However, since this whole time GCC would have been doing this anyway with certain march flags (as Dom demonstrated), I think it can safely be concluded that allowing the compiler to optimize things to use FMA globally via -mfma wouldn't noticeably change any behavior whatsoever inside of LMMS. Assuming I'm not missing some important detail, I highly recommend taking this approach...

IF all of the hardware we target supports FMA. I don't know whether this is the case. If this isn't the case, then we can still allow it to potentially be enabled for compiled builds, but it would have to be disabled for releases on the website or else the program simply wouldn't run on that hardware.


In any case, unless we take the aforementioned manual FMA approach with -ffp-contract=off (which I don't think we should do), there isn't any reason for fastFma to exist and it should be removed, if I'm not mistaken.

I don't know a lot about this topic, so hopefully I'm not being completely glaringly wrong on every front here in some way I'm not noticing. That would be embarrassing.

@Rossmaxx
Copy link
Contributor

Thanks for the clarification lost, now another question, would it make sense to change other calls to std::pow to use fastPow instead?

@LostRobotMusic
Copy link
Contributor

Thanks for the clarification lost, now another question, would it make sense to change other calls to std::pow to use fastPow instead?

Probably not, only if you really know what you're doing. fastPow is an approximation, and I don't think anybody's measured its error yet. It runs 5x as fast on my computer and 14x as fast on my laptop, but its speed doesn't matter if it's too inaccurate to work for the task it's being used for. Backwards compatibility needs to be kept in mind as well. The first absolutely necessary step would be measuring its accuracy.

I recommend not bothering with replacing past pow uses except in cases where its usage is causing a noticeable performance detriment. Do not use it in cases where the pow base is known at compile time (e.g. pow(2,x) or pow(10,x)), those can be optimized in significantly better and more accurate ways than a general-purpose pow approximation. I'll eventually make a PR to optimize LMMS's dBFS/amplitude conversion functions specifically.

On GCC with -O1 or -O2 optimizations, this new implementation generates
identical assembly to the old union-based implementation
@Rossmaxx
Copy link
Contributor

@LostRobotMusic you did say you use std::pow in LOMM right? Wanna replace?

@LostRobotMusic
Copy link
Contributor

@LostRobotMusic you did say you use std::pow in LOMM right? Wanna replace?

No, I said I use the dBFS/amplitude conversion functions. In my last message I said I'll make a PR to optimize those.

@Rossmaxx
Copy link
Contributor

Rossmaxx commented Sep 29, 2024

Regarding LOMM, I remember you said somewhere in discord. Per : https://discord.com/channels/203559236729438208/784594576223633478/1286903926653714433

I dug out the exact chat this time to avoid flak.

@LostRobotMusic
Copy link
Contributor

Regarding LOMM, I remember you said somewhere in discord. Per : https://discord.com/channels/203559236729438208/784594576223633478/1286903926653714433

I dug out the exact chat this time to avoid flak.

LOMM uses LMMS's dbfsToAmp function, and LMMS's dbfsToAmp function uses std::pow. std::pow is not directly used anywhere in LOMM's code, and as I said, I'll eventually make a PR to optimize LMMS's dBFS/amplitude conversion functions specifically.

@Rossmaxx
Copy link
Contributor

Okk, i misunderstood that point.

@messmerd messmerd merged commit 121d608 into LMMS:master Oct 1, 2024
11 checks passed
@messmerd messmerd deleted the fast-math branch October 1, 2024 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants