-
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add single instruction multiple data functionality for GDScript / C# / C++ #290
Comments
It's important to also make this work with It's also worth looking into making this work with Bullet, currently there are some TODOs about SIMD being disabled because it's actually slower for some reason. |
Sure this is easy 😄 , I just started with vec4f_32 as it is the most common so far in Godot. Double is no problem as are the integers. Integers are even more fun, you can really start giving awesome speedups with vectorization .. for instance with AVX512 (if it has the byte version, not sure offhand) you can do 64 calculations in a single instruction. I was asking Calinou about the minimum SSE / Neon we support. It seems for x86_64 SSE2 is mandated, so the compilers will autovectorize providing you tell them to (e.g. -O3). For x86_32 it may be worth specifying SSE2 flag (I think this may be done in the odd place already). For any more than this though, we'd need to detect support at runtime. This is fairly easy to put in, and choose the codepath according to the CPU support, rather than rely on autovectorization. However that's getting ahead of ourselves. Last time we mentioned it reduz might need convincing on the use of intrinsics. Having said that, if the basic support for ranged functions is put in, it is fairly trivial to add intrinsics. |
You reserved space in the fast array but left the normal array empty, that immediately skews the result. I already experienced significant speedups with a .resize() for any moderate array initialization. push_back would just resize by 1 every time, basically reallocating the entire thing. Not saying it'll be as fast as native code, but faster than this example - most likely. |
Sorry I could have made this more clear. The reservation method isn't under test - it isn't timed (or indeed a factor in this proposal). I could have equally used a resize and set than push_back, as you say. The performance timing is wrapped around the mathematical functions under test, in this case addition, with the:
and
calls. In fairness, the basic gdscript array in the example I gave is not contiguous in memory, so I made the timings in the 'more timings' section against a PoolVector, which is contiguous, giving a more level playing field. I would welcome others trying some timings themselves, you just need to download the module and compile the engine. That reminds me I may need to tweak the compiler switches in the module SCsub to work under visual studio (I only have linux here). |
This sounds like an oxymoron to me. If you want to have high-performance code you should not be using GDScript in the first place. That's what GDNative/NativeScript is for (using fast C code as a script). If the engine internal API can be optimized using vectorization and similar tricks, which in turn could be used by GDScript, then it should be fine. But having high-performance constructs for GDScript itself is not worth it IMO. |
Yes sorry for not being more clear, these type of instructions could be available throughout for use by c++ modules, gdnative, the main engine, c#, gdscript and any other language bindings via the API. I believe they already are in my demo module, afaik you don't have to do anything special apart from derive the classes from something like Reference / Node etc, do the proper binding etc. I'm not advocating adding anything specific in gdscript, no changes needed. I guess my emphasis on gdscript is that although in scripting languages performance isn't the primary consideration, the use of functions that act on multiple sets of data at once allows them to be as performant (or in fact more so) than non-SIMD regular c++ code. Essentially you are moving the bottlenecks into the ranged c++ functions. Mike Acton has famously done a lot on this subject: Of course the advantage of writing high performance stuff in gdscript / c#, is that it is cross platform and doesn't require compiling the engine / gdnative, and thus easier to deploy and use in games / apps. |
If C# is included this probably won't be limited to C# and expand to all .NET languages. In this case I would prefer if one speaks for such features rather of .NET support instead of C# support. |
It would be nice to have access to intrinsics for my audio project since I'm working mainly in c# and the version supported by the mono runtime (7.2 or netstandard2.1) doesn't seem to have these. I'm surprised how long this proposal has remained open! |
It looks like there is demand for this functionality, and I believe there has been a longstanding intention to add some degree of SIMD support (other than compiler autovectorization) in Godot 4. (Often if we get the go ahead for something in Godot 4 we manage to sneak it into Godot 3.x 😉 ) |
A few more ideas I came up with on ways to implement this and be generally available as usable from the engine core as well as bound: User FriendlinessThere may be a bit of a balance to be struck between making something accessible for beginners to SIMD, and more advanced users and engine contributors. On the whole I suspect this whole area is going to be the realm of power users. Mixed DataOne aspect that I left for later in my original mockup was how to handle mixed data. In terms of user friendliness, having a bunch of preset fixed arrays serves as a gentle introduction to SIMD. However, this does limit the usefulness for more advanced stuff, particularly in the engine core, where we might have e.g. mixed vertex formats. Essentially, to the machine, it doesn't care what a set of data is, the format is defined just by what you write into the data, do with it, and read out. To that end I'm wondering as a more flexible alternative, instead of having multiple fixed types, we could have a single External DataIt has occurred to me that as well as creating data within Thus the Or perhaps Internal DataFor internal data, instead of creating an Array of e.g. 4f32, we could have functions to create an array of a number of Operating on DataOnce the data in the Extracting ResultsThis would ideally be coupled with nice ways of getting the final data result out of internal Introducing SIMD into the engine coreA lot of core contributors have expressed an interest in having SIMD available in some form in the core engine. There already will be autovectorization in some cases, and it would also be possible to e.g. add single use instructions for e.g. This can be useful, but it is unlikely to lead to as many performance benefits as directly using ranged instructions imo (for a number of reasons, but essentially outside of tight ranged loops other effects start to become more important like cache misses and housekeeping). We could possibly end up using both approaches, but hopefully there are a number of areas where changing bottleneck areas to use such ranged functions will lead to significant benefit. |
Linking this here: apologies the topic seems duplicated a bit Anyways, I really like this is being thought about, I do think AVX detection should be a runtime check too, so this helps this proposal to work. In my comment I propose decoupling the data from the underlying math type and using a runtime static variable to detect which CPU features are enabled and consuming them that way. During construction it would create the implementation, with CPU features supported and types for those feature sets. A template for swapping out the We'd need to consider an approach for this in the implementation and possibly try making a PR with some demo code for one of the underlying types just to take a small step in the direction of SIMD instead of doing everything. We could support using SIMD from gdnative, but I think using it from gdscript could be possible too provided we can ensure the user checks what cpu features are available with a match and also do this runtime swap in their code of the feature set. GDScript: if we decoupled the data from the math types in my suggestion we could even make use_double runtime capable if we made a parent type for real in gdscript. (at the expense of declaring a static pointer to the implementation and structure and declaring a "Real" type instead of raw float) the hard part is static typing from gdscript would require they use a Variant on their data which potentially stores mm256 vs mm256d etc This mostly means that #4563 and #290 could be part of the same feature set, and only require us to decouple the core types correctly, and expose the vector specific instructions all of the time to gdscript, and implement a OS.get_supported_instructions()` function as a dictionary. We could possibly have defines in gdscript but checking a dict value is dead easy. Example of using avx variable in gdscript: class_name ExamplePSUDO
var data = [ 0,0,0 ]
func _ready():
if OS.get_supported_instructions["avx"]:
data = mm256(0,0,0,0)
func do_some_operation(input, input1):
match OS.get_current_instruction_enum():
AVX:
inputs = concat(input1, input2)
return _mm256_add_ps(inputs)
NO_SUPPORT:
return input + input1
```
Of course, my template is not 100% accurate for the instructions being used, so I will look at this and make it more accurate
|
@fire suggested looking at portable-simd from WASM as this might be a good approach for handling it too. |
Note: changing the title back to the original. A timely example - the current discussion of ray casting API casting a single ray. More efficient is to write API that can cast multiple rays in the same call, and return multiple results. This amortizes the cost of the function call etc. |
The example library does this, see:
My refined suggestion was to have it operate freeform on a data blob: For SIMD to work most efficiently you need to structure the data in a SIMD friendly manner yes, so you can make best use of cache / linear reads / avoid shuffles etc.
Again, unnecessary if all this can be handled transparently to the user.
I'm not sure what this means. SIMD instructions can work on raw data, either with an aligned or unaligned load. There's no requirement for it to be stored in a struct as e.g. In your example @RevoluPowered , why should the gdscript user care whether this uses AVX or SSE or reference under the hood? Is there any need to expose this complexity to the user? Surely a wrapper can do all of this, as this proposal suggests?
|
I think XSIMD is a good fit to be incorporated into Godot. It's a template library that abstracts SIMD instruction sets and allows for runtime detection of best available SIMD instruction sets. |
Sorry if this is too off-topic, but I feel like a reasonable "workaround" for those that need this now might be using the SIMD-accelerated .NET types. I have a feeling the overhead from transforming between Godot and .NET types might make it not worth it though. |
Hi, there's some impressive stuff in your proof of concept! I've just made one of my own - a project that basically attempts to be NumPy for Godot. You can find it here: https://github.com/Ivorforce/NumDot/tree/main My proof of concept uses xtensor, which itself is accelerated by xsimd (the same xsimd as linked by @rossbridger). I have currently achieved up to 30x (edit: now up to 350x) speed up when compared to gdscript itself; some performance is definitely lost by supporting variable data types, but I'm not sure if the code is even fully simd accelerated yet. It might be possible to get more out of it once I get around to checking if everything is actually set up correctly for xsimd. It wasn't super complicated to bridge the interfaces into a gdextension, and thus the approach might be an option going forward. I definitely support Godot adopting some SIMD capabilities, where they most benefit speeds in the engine itself, though whether or not ndarrays should be supported in the core project is a harder sell. |
Describe the project you are working on:
High computation in gdscript.
Describe the problem or limitation you are having in your project:
While gdscript is great for many purposes, it sacrifices performance for the portability and ease of use. I have come across this as a limitation in computationally intensive tasks before.
I came upon this thread, where people discuss the issue:
https://www.reddit.com/r/godot/comments/e71g7x/high_performance_code_in_gdscript/
Describe how this feature / enhancement will help you overcome this problem or limitation:
One of the fundamental changes that can increase the speed of code (in any language) is to move from a paradigm of functions that operate on a single item of data, to those that operate on multiple data. As well as encouraging a more optimal layout of the data, this often allows compilers to better optimize the procedure. It also offers more opportunities for autovectorization (generation of SSE and Neon SIMD code), which can offer significant speedup.
To distinguish between actual SIMD cpu instructions and the multiple data from one instruction paradigm, I will call the latter 'ranged' functions.
Show a mock up screenshots/video or a flow diagram explaining how your proposal will work:
I therefore made a quick mockup module yesterday in c++, testing an equivalent function against 'normal' gdscript. The functions take the form:
function_name(arguments, range_from, range_to)
The timings on my PC give:
timing ranged 7ms
timing array 1953ms
This is a 279x speed increase. (This is not exactly like for like, as the module is built with -O3 and godot is built with release_debug, but you get the idea).
Describe implementation detail for your proposal (in code), if possible:
I have made a proof of concept module:
https://github.com/lawnjelly/godot-simd
Some of the most commonly used data structures in godot are the float Vector2, Vector3, and Vector4 (only available as a Quat / Plane etc currently rather than explicit vec4). These work reasonably with the 4 value 32 bit float SIMD registers available on nearly all modern computers. As a result I built the proof of concept to work with this unit, which corresponds to _m128 in SSE.
With a Vector3 this does represent a potential 'waste' of the 4th float, however this is probably outweighed by the gains in speed due to alignment. The 4th float is also useful to return a result, I have used it to store lengths, square lengths and dot products for example.
Although some of the specific functions I have added rely on this arrangement, you can also use a SoA (structure of arrays) with the functionality if desired.
Some notes
I am also not proposing that this should be a finished product, I am open to ideas on how best to add this type of functionality, down to naming conventions etc. willnationsdev pointed out the difficulty of adding new types to Variant, so I've added a new object type derived from Reference, this seems to work. There may be better ways of passing some of the arguments.
One important consideration may be how to best get data in / out of the system to actually do something with the result. This is a common theme with SIMD. As such I have added proof of concept functions to fill the fast array from PoolVector3 and return the result to a PoolVector3.
If this enhancement will not be used often, can it be worked around with a few lines of script?:
Cannot be worked around with script.
Is there a reason why this should be core and not an add-on in the asset library?:
I'll probably round this out as a module even if it is decided not to put something like this in core. However, for something so simple, I think it offers a lot of potential, both to people making games, tools, and also potentially for use from other parts of the core.
Functions already included
More timings (for those interested to compare intrinsics)
This time with a PoolVector3Array for gdscript, to make things fairer, and comparing sqrt functions for 20,000,000 sqrts, using either normal gdscript, ranged non-SSE code, and ranged SSE1 code:
timing gdscript 8228ms
timing ranged 86ms
timing SSE 8ms
This is over 1000x speed increase with SSE. I have looked at the potential for SSE at the same time while doing the ranged functions. Up to SSE2 can be used with no need for CPU detection on 64 bit x86, as it is mandated. With CPU detection we could use AVX512, which would theoretically be up to 4000x faster (although memory speed might become more a bottleneck).
Things like sqrt and reciprocal sqrt, length calculations and normalization really get a lot faster because SIMD uses a slightly less accurate but much faster version than standard sqrt. Afaik these can only be accessed by intrinsics, I don't think you can get the autovectorizer to make them for you, as there is a small loss of accuracy.
The text was updated successfully, but these errors were encountered: