Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: add optimized versions of PROD_INTOBJS #1047

Merged
merged 1 commit into from
Jan 8, 2017

Conversation

fingolfin
Copy link
Member

@fingolfin fingolfin commented Jan 5, 2017

Add three variants of PROD_INTOBJS using portable GCC extensions such as
__builtin_smul_overflow (supported by GCC >= 5.0 and clang >= 3.4)

I took great care to verify via test programs written in that the new
implementations yields identical results to PROD_INTOBJS, at least on
64bit systems (and I see no reason why 32bit should be different).

Regarding performance: in a C micro benchmark, the new code outperforms
the old by a factor of 2 on my laptop. In GAP, the gain is not as
impressive, probably because other things add a lot of overhead. Still,
there is a gain of about 10%-15% with this micro benchmark:

foo:=function(b)
local r,i,j,x;
r:=[-b..b];;
for i in r do
for j in r do
x:=i*j;
od;
od;
return x;
end;

Old code:
gap> foo(2^12);;time;
1480
gap> foo(2^13);;time;
5559

New code:
gap> foo(2^12);;time;
1186
gap> foo(2^13);;time;
4849

For comparison, the C mini benchmark does something very similar, but is
faster by a factor 18 (for inputs 2^12, 2^13) to 40 (for input 2^14).

Disassembling the code generated by GCC and clang for the new
implementation shows that it is very close what I'd write by hand, while
the old code is a very complicated, and requires divisions in some cases
(= slow).

Please make sure that this pull request:

  • is submitted to the correct branch (the stable branch is only for bugfixes)
  • contains an accurate description of changes for the release notes below
  • provides new tests or relies on existing ones
  • correctly refers to other issues and related pull requests

Tick all what applies to this pull request

  • Adds new features
  • Improves and extends functionality
  • Fixes bugs that could lead to crashes
  • Fixes bugs that could lead to incorrect results
  • Fixes bugs that could lead to break loops

Write below the description of changes (for the release notes)

@fingolfin fingolfin added the kind: enhancement Label for issues suggesting enhancements; and for pull requests implementing enhancements label Jan 5, 2017
@codecov-io
Copy link

codecov-io commented Jan 5, 2017

Current coverage is 49.72% (diff: 100%)

Merging #1047 into master will increase coverage by 0.06%

@@             master      #1047   diff @@
==========================================
  Files           424        424          
  Lines        223275     223191    -84   
  Methods        3429       3429          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits         110880     110986   +106   
+ Misses       112395     112205   -190   
  Partials          0          0          

Powered by Codecov. Last update fd9d4e3...f96b61d

@ChrisJefferson
Copy link
Contributor

This looks fine to me. I'll count on your changes to the autoconf being right, they look "right enough" to me. Will just see if anyone else has a comment.

@fingolfin
Copy link
Member Author

I just slightly adjusted the autoconf code, to make it more flexible, as I want to use it in the future to check for other builtins which require a different number of arguments.

And yeah, I am pretty sure they are "right enough" ;-).

Add three variants of PROD_INTOBJS using portable GCC extensions such as
__builtin_smul_overflow (supported by GCC >= 5.0 and clang >= 3.4)

I took great care to verify via test programs written in that the new
implementations yields identical results to PROD_INTOBJS, at least on
64bit systems (and I see no reason why 32bit should be different).

Regarding performance: in a C micro benchmark, the new code outperforms
the old by a factor of 2 on my laptop. In GAP, the gain is not as
impressive, probably because other things add a lot of overhead. Still,
there is a gain of about 10%-15% with this micro benchmark:

foo:=function(b)
  local r,i,j,x;
  r:=[-b..b];;
  for i in r do
    for j in r do
      x:=i*j;
    od;
  od;
  return x;
end;

Old code:
gap> foo(2^12);;time;
1480
gap> foo(2^13);;time;
5559

New code:
gap> foo(2^12);;time;
1186
gap> foo(2^13);;time;
4849

For comparison, the C mini benchmark does something very similar, but is
faster by a factor 18 (for inputs 2^12, 2^13) to 40 (for input 2^14).

Disassembling the code generated by GCC and clang for the new
implementation shows that it is very close what I'd write by hand, while
the old code is a very complicated, and requires divisions in some cases
(= slow).
@ChrisJefferson ChrisJefferson merged commit a170b87 into gap-system:master Jan 8, 2017
@fingolfin
Copy link
Member Author

Wouldn't a rebase have been nice for this? This, together with the merge of #960, now makes for a somewhat confusing history again. Alternatively, if you prefer a merge commit, then I'd have been happy to rebase this first, before the merge, to make for simpler history.

@ChrisJefferson
Copy link
Contributor

I just forget all the options of github.

@fingolfin fingolfin deleted the mh/PROD_INTOBJS branch January 13, 2017 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: enhancement Label for issues suggesting enhancements; and for pull requests implementing enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants