Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AArch64: Improve arraycopy inlining #7318

Merged
merged 1 commit into from
May 15, 2024
Merged

Conversation

knn-k
Copy link
Contributor

@knn-k knn-k commented Apr 25, 2024

This commit improves the inlined code of arraycopy for AArch64, by using ldp/stp instructions with a pair of SIMD registers, which can load/store 32 bytes at a time.

This commit improves the inlined code of arraycopy for AArch64, by
using ldp/stp instructions with a pair of SIMD registers, which can
load/store 32 bytes at a time.

Signed-off-by: KONNO Kazuhiro <konno@jp.ibm.com>
@knn-k
Copy link
Contributor Author

knn-k commented Apr 25, 2024

Jenkins build aarch64,amac

@knn-k
Copy link
Contributor Author

knn-k commented Apr 25, 2024

See the sample code generated for copying 290 bytes forward.
In this code, x1 is the src address and x2 is the dst address. q0 and q1 are 128-bit SIMD registers. Less iterations in the loop (4->2), and fewer instructions (-2).

Existing implementation:

	mov	x0, #4	// loop count
loop:
	// Copy 16x4 bytes in the loop
	ldr	q0, [x1], #16
	str	q0, [x2], #16
	ldr	q0, [x1], #16
	str	q0, [x2], #16
	ldr	q0, [x1], #16
	str	q0, [x2], #16
	ldr	q0, [x1], #16
	str	q0, [x2], #16
	sub	x0, x0, #1
	cbnz	x0, loop
	// Copy remaining 34 bytes
	ldr	q0, [x1, #0]
	str	q0, [x2, #0]
	ldr	q0, [x1, #16]
	str	q0, [x2, #16]
	ldrh	w3, [x1, #32]
	strh	w3, [x2, #32]

New implementation:

	mov	x0, #2	// loop count
loop:
	// Copy 32x4 bytes in the loop
	ldp	q0, q1, [x1], #32
	stp	q0, q1, [x2], #32
	ldp	q0, q1, [x1], #32
	stp	q0, q1, [x2], #32
	ldp	q0, q1, [x1], #32
	stp	q0, q1, [x2], #32
	ldp	q0, q1, [x1], #32
	sub	x0, x0, #1
	stp	q0, q1, [x2], #32
	cbnz	x0, loop
	// Copy remaining 34 bytes
	ldp	q0, q1, [x1], #32
	stp	q0, q1, [x2], #32
	ldrh	w0, [x1, #0]
	strh	w0, [x2, #0]

@knn-k
Copy link
Contributor Author

knn-k commented Apr 25, 2024

See #7181 for the failure on x86 macOS.
I ran OpenJ9 sanity.functional and sanity.openjdk on AArch64 Linux and macOS with this change, and the jobs finished successfully.

@knn-k knn-k marked this pull request as ready for review April 25, 2024 23:07
@knn-k knn-k requested a review from 0xdaryl as a code owner April 25, 2024 23:07
@knn-k
Copy link
Contributor Author

knn-k commented May 6, 2024

@0xdaryl Please review.

@0xdaryl 0xdaryl merged commit 57a22b2 into eclipse:master May 15, 2024
6 of 8 checks passed
@knn-k knn-k deleted the aarch64acInlining branch May 15, 2024 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants