Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: arm64 performance optimizations #4288

Merged
merged 8 commits into from
Dec 12, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,6 @@
[submodule "ext_libs/vcpkg"]
path = ext_libs/vcpkg
url = ../../microsoft/vcpkg.git
[submodule "ext_libs/sse2neon"]
path = ext_libs/sse2neon
url = https://github.com/DLTcollab/sse2neon
7 changes: 6 additions & 1 deletion cmake/VWFlags.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,14 @@ if("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "x86_64")
endif()
endif()

set(LINUX_ARM64_OPT_FLAGS "")
if("${CMAKE_SYSTEM_PROCESSOR}" MATCHES "aarch64|arm64|ARM64")
set(LINUX_ARM64_OPT_FLAGS -mcpu=neoverse-n1)
endif()
Copy link
Member

@jackgerrits jackgerrits Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to this CPU, so we should not be adding it whenever we encounter an arm CPU. Generally for specific optimization flags like this it is better to add them when configuring your own build.

So, in this case we should remove this but in your build you would add the following to your command line to achieve the same outcome:

-DCMAKE_CXX_FLAGS_RELEASE="-mcpu=neoverse-n1"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to go ahead and push a change removing this piece so we can move forward here.


# Add -ffast-math for speed, remove for testability.
# no-stack-check is added to mitigate stack alignment issue on Catalina where there is a bug with aligning stack-check instructions, and stack-check became default option
set(LINUX_RELEASE_CONFIG -fno-strict-aliasing ${LINUX_X86_64_OPT_FLAGS} -fno-stack-check -fomit-frame-pointer)
set(LINUX_RELEASE_CONFIG -fno-strict-aliasing ${LINUX_X86_64_OPT_FLAGS} ${LINUX_ARM64_OPT_FLAGS} -fno-stack-check -fomit-frame-pointer)
set(LINUX_DEBUG_CONFIG -fno-stack-check)

#Use default visiblity on UNIX otherwise a lot of the C++ symbols end up for exported and interpose'able
Expand Down
1 change: 1 addition & 0 deletions ext_libs/sse2neon
Submodule sse2neon added at 270cf6
13 changes: 12 additions & 1 deletion vowpalwabbit/core/src/reductions/lda_core.cc
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ VW_WARNING_STATE_POP
#include "vw/core/vw_versions.h"
#include "vw/io/logger.h"

#if defined(__ARM_NEON)
#include <sse2neon/sse2neon.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has not been made available to the vw_core cmake target, so the build won't be able to find this. We also don't have any CI which is going to cause this path to be exercised so we need to be careful.

The include should be added here:

target_include_directories(vw_core PRIVATE ${CMAKE_CURRENT_LIST_DIR}/src)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just created a package in vcpkg for sse2neon microsoft/vcpkg#28129

Would it be okay if I add the header directly in the source code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is still preferable to use the submodule. I can push a commit to the branch with this change,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can push a commit to the branch with this change

That would be nice of you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackgerrits FYI, I added sse2neon to vcpkg. Would you prefer that I add the package instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding that, that's super helpful! There was a tiny issue with the installed path, I went ahead and opened a fix here microsoft/vcpkg#28268. When that gets merged I will update this PR to consume.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go ahead and merge this one and we can update to consume vcpkg later.

#endif

#include <algorithm>
#include <cassert>
#include <cmath>
Expand Down Expand Up @@ -164,7 +168,7 @@ inline float fastdigamma(float x)

#if !defined(VW_NO_INLINE_SIMD)

# if defined(__SSE2__) || defined(__SSE3__) || defined(__SSE4_1__)
# if defined(__SSE2__) || defined(__SSE3__) || defined(__SSE4_1__) || defined(__ARM_NEON)

namespace
{
Expand All @@ -186,6 +190,13 @@ inline bool is_aligned16(void* ptr)
# include <smmintrin.h>
# endif

// Transport SSE intrinsics through sse2neon on ARM:
#if defined(__ARM_NEON)
#define __SSE2__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#endif

# define HAVE_SIMD_MATHMODE

typedef __m128 v4sf;
Expand Down