Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common: apply two stage copy to aarch64 #3145

Merged
merged 1 commit into from
Jun 9, 2022
Merged

Conversation

JunHe77
Copy link
Contributor

@JunHe77 JunHe77 commented May 26, 2022

On aarch64 ZSTD_wildcopy uses a simple loop to do
16B based memory copy. There is existing optimized
two stage copy that can achieve better performance.
By applying this to aarch64 it is also observed ~1%
uplift in silesia corpus.

Signed-off-by: Jun He jun.he@arm.com
Change-Id: Ic1253308e7a8a7df2d08963ba544e086c81ce8be

On aarch64 ZSTD_wildcopy uses a simple loop to do
16B based memory copy. There is existing optimized
two stage copy that can achieve better performance.
By applying this to aarch64 it is also observed ~1%
uplift in silesia corpus.

Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: Ic1253308e7a8a7df2d08963ba544e086c81ce8be
@embg embg self-assigned this Jun 1, 2022
@Cyan4973
Copy link
Contributor

Cyan4973 commented Jun 2, 2022

By aarch64, could you provide some details about the exact type of cpu / soc this patch has been benchmarked on ?

1 similar comment
@Cyan4973

This comment was marked as duplicate.

@terrelln terrelln self-assigned this Jun 2, 2022
@JunHe77
Copy link
Contributor Author

JunHe77 commented Jun 3, 2022

Hi, @Cyan4973 , the result have been benchmarked on the Arm N1/A72/A57 platforms and observed similar uplift.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jun 6, 2022

I can't remember why this code was added here.

It could be that, with aarch64 being merely an instruction set under which so many different SoCs are built, maybe some of them (non N1) would prefer the first loop.
However, it's pretty hard to confirm / test (couldn't find a test platform where this holds true).

From what I can see, the second formulation just separates the first branch from later ones, so that it can have its own statistics (as opposed to being merged with other loop iterations). Such a construction is expected to be rather good in the context of wildcopy, essentially distinguishing small copies from larger ones. This should translate into almost always better performance, except maybe for specific systems with rather poor branch predictors.

So I'm gonna make an educated guess here and state that this PR seems tends to improve the situation, on top of simplifying it by removing a weird and poorly documented corner case.

1 similar comment
@Cyan4973

This comment was marked as duplicate.

@JunHe77
Copy link
Contributor Author

JunHe77 commented Jun 7, 2022

Thank you @Cyan4973 for the review. I used to check the log info of that change (969ba4f), but could not find the context of it. It looks like this is designed for compression. With my benchmarks here, I didn't find regression in compression with removing the aarch64 specific part on N1/A72/A57.

@embg embg removed their assignment Jun 9, 2022
@terrelln
Copy link
Contributor

terrelln commented Jun 9, 2022

Thanks for the PR @JunHe77!

@terrelln terrelln merged commit 3b915cd into facebook:dev Jun 9, 2022
@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
@JunHe77 JunHe77 deleted the wildcopy branch March 12, 2023 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants