fast PowerSeries_poly multiplication #10480

sagetrac-pernici · 2010-12-16T10:02:56Z

In this patch truncated multiplication of dense polynomials is
used in PowerSeries_poly multiplication.

in Sage-4.6 on my computer with Intel Core i7 2.8GHz

sage: R.<a,b> = QQ[]
sage: K.<t> = PowerSeriesRing(R)
sage: time p1 = (1 + a*t + b*t^2 + O(t^50))^-40
Wall time: 7.62s

with this patch it takes 0.12s
The speed-up increases with the number of variables and with the precision.

Apply: trac_10480_fast_PowerSeries_poly_multiplication-alternative.patch

Depends on #10255

CC: @nilesjohnson @zimmermann6

Component: commutative algebra

Keywords: power series

Author: Mario Pernici, Luis Felipe Tabera Alonso

Branch/Commit: public/ticket/10480 @ e2cdc54

Issue created by migration from https://trac.sagemath.org/ticket/10480

The text was updated successfully, but these errors were encountered:

sagetrac-pernici · 2010-12-16T10:18:31Z

Attachment: trac_10480_fast_PowerSeries_poly_multiplication.patch.gz

lftabera · 2010-12-16T10:53:17Z

comment:1

Could you please check if the changes of #10255 in karatsuba multiplication provide further improvement?

PowerSeries are constructed over commutative rings, but polynomial_elements do not. Hence mul_trunc needs to work also over noncommutative rings. It does not right now (al least lines 5342 and 5345 do not differenciate between left and right multiplication).

I see several improvements:

I think that

cdef int i, j, a1len, a2len, n1, n2, ih

in line 5334 should be Py_ssize_t, but maybe a cython expert should look at this.

c = [0]*h in line 5346 of the patch is inefficient in several cases. One of the most contributions in 10480 is to avoid python 0 'int' in the code, that can be (relatively) expensive to coerce. I guess that this code will be slow if the base ring is a number field (that has slow coercion).

The last loop is classical multiplication up to the precision required. I wonder if part of that multiplication could use do_karatsuba or a truncated_do_karatsuba.

lftabera · 2010-12-17T20:35:48Z

comment:2

Some more numbers with your patch showing what I said on my previous post.
Patch aginst a clean sage 4.6

with patch:

sage: K.<t>=QQ['a'][]
sage: p1 = K.random_element(800) + O(t^990)
sage: p2 = K.random_element(800) + O(t^990)
sage: %time _=p1*p2
CPU times: user 6.09 s, sys: 0.08 s, total: 6.17 s
Wall time: 6.49 s

without patch

sage: sage: K.<t>=QQ['a'][]
sage: sage: p1 = K.random_element(800) + O(t^990)
sage: sage: p2 = K.random_element(800) + O(t^990)
sage: sage: %time _=p1*p2
CPU times: user 3.24 s, sys: 0.03 s, total: 3.27 s
Wall time: 3.39 s

I think that we can define an algorithm that works like karatsuba but that has a prec parameter. If a portion will have order greater than the prec. That part is discarded.
The prec is updated on recursive calls.

lftabera · 2010-12-21T22:51:05Z

comment:4

I will try to implement the algorithm at "A long note on Mulders’ short product" of Hanrot and Zimmermann that looks not harder than Karatsuba. The authors claim that heuristically the gain time is on the average 0.7 x Karatsuba time.

Also, the following cases have to be correctly covered:

sage: O(x)*O(x)
0
sage: O(x^2)*O(x^3)
0

sagetrac-pernici · 2010-12-22T22:17:09Z

Attachment: trac_10480_fast_PowerSeries_poly_multiplication2.patch.gz

Attachment: un.sage.gz

benchmark file

sagetrac-pernici · 2010-12-22T22:18:20Z

comment:5

Hello,

lftabera wrote:

I think that we can define an algorithm that works like karatsuba but that has a prec parameter. If a portion will have order greater than the prec. That part is discarded. The prec is updated on recursive calls.

I followed your suggestion to perform truncated multiplication recursively,
keeping track of the precision, calling Karatsuba multiplication
when there are no truncations.

Also, the following cases have to be correctly covered:

sage: O(x)*O(x)
0
sage: O(x^2)*O(x^3)
0

The new patch trac_10480_fast_PowerSeries_poly_multiplication2.patch passes all tests, included these.
It performs generally better than the unpached Sage4.6,
except the case in which the series has degree lower than the precision,
like your example

p1 = K.random_element(800) + O(t^990)

on which it is typically up to 50% slower, while with the previous
patch it was 2x slower.

The first patch was also slow in other cases, see below.

The performance of truncated multiplication vs Karatsuba multiplication
depends on the distribution of the terms in the univariate series;
truncated multiplication is fast when the high powers in the univariate
variable t have many more terms than the low powers, like in this example

using this patch:

sage: from sage.rings.power_series_poly import PowerSeries_poly
sage: R.<a,b> = QQ[]
sage: K.<t> = R[]
sage: p = (1 + a*t + b*t^2 + O(t^50))^-1
sage: v = [len(x.coefficients()) for x in p.coefficients()]
sage: v[:6], v[-6:]
([1, 1, 2, 2, 3, 3], [23, 23, 24, 24, 25, 25])
sage: %timeit p*p
25 loops, best of 3: 26 ms per loop
sage: %timeit PowerSeries_poly(p.parent(),p.polynomial().mul_trunc(p.polynomial(),50,1),50)
125 loops, best of 3: 4.89 ms per loop
sage: p*p - PowerSeries_poly(p.parent(),p.polynomial().mul_trunc(p.polynomial(),50,1),50)
O(t^50)

unpached Sage (main):

sage: %timeit p*p
5 loops, best of 3: 833 ms per loop

mul_trunc(self, right, h, generic=1) calls do_mul_trunc_generic, which
is the algorithm used in the first patch; in this case it is faster,
but in other cases it is slower, so I think it is better to use do_mul_trunc.
However in multivariate series do_mul_trunc_generic seems to perform always
better; I will post a patch about this in ticket #1956

p = K.random_element(50) + O(t^50)

has uniform distribution; for p*p the patched version performs as the main version.

In un.sage one takes N subsequent p=p*(p+1) starting with a
random univariate series on a polynomial ring with n variables;
the ratio of the number of terms in the upper half of the univariate series
and the lower half is given.
The ratio is close to 1 in a random polynomial, and increases with n and N.
There is a correlation between the speedup
using truncated multiplication and this ratio.
The old patch (patch1) in some cases performs badly; the worst case is given.

n variables in polynomial ring on QQ; h precision;
N = number of times p = p*(p+1) is taken
n=1
h=20  N=1     2    3     4    5    6   7
ratio  0.86  0.98  1     1    1    1   1
main   0.005 0.008 0.022 0.09 0.41 3.5 53.5
patch  0.004 0.006 0.016 0.06 0.31 2.5 41.5
h=100
ratio  0.93  1     1     1    1
main   0.049 0.14  0.47  1.6  7.9
patch  0.043 0.10  0.31  1.1  5.5
h=500
ratio  0.96  1.0
main   1.5   20.9
patch1 2.0   49.1
patch  1.3   17.0

n=2
h=20
ratio  0.81  1.06  1.10  1.19 1.34
main   0.009 0.04  0.30  4.3  134
patch  0.007 0.03  0.22  2.6  61
h=100
ratio  1.05  1.01  1.0   1.0
main   0.11  0.80  7.8   171
patch  0.09  0.59  5.5   103

n=4
h=20
ratio  1.08  1.53  1.64
main   0.015 0.35  21.0
patch  0.010 0.19  10.9

n=8
h=10
ratio  1.28  2.67  5.43
main   0.008 0.38  127
patch  0.006 0.046 4.4

n=16
h=10
ratio  1.00  2.77  8.90
main   0.013 1.95  > 19m(6GB)
patch  0.009 0.14  29.0(0.5GB)

In the last example I stopped the computation with main because too much
memory was used.

I will try to implement the algorithm at "A long note on Mulders’ short product" of Hanrot and Zimmermann that looks not harder than Karatsuba. The authors claim that heuristically the gain time is on the average 0.7 x Karatsuba time.

If I understand correctly, that paper deals with univariate series
on numeric rings. Maybe in the case of univariate series over polynomial
rings one could use the list of the number of elements to guide the
algorithm.

lftabera · 2010-12-23T15:41:32Z

comment:6

Oops, we have been duplicating work,

I have written a proof of concept of a truncated Karatsuba to check that it derives a correct algorithm. It is much much slower than plain sage. It needs to eliminate unnecesary slicing and calls of type K(f).list()

I will take a look at your patch.

My patch depends on #10255

lftabera · 2010-12-28T09:53:31Z

comment:7

Mario,

This part on your patch

zeros = [0] * e 
t0 = do_mul_trunc(b,d,h) 
t1a = do_mul_trunc(a,d,h-e) 
t1b = do_mul_trunc(b,c,h-e) 
t1 = zeros + _karatsuba_sum(t1a,t1b) 
t2 = zeros + zeros + do_mul_trunc(a,c,h-2*e) 
return _karatsuba_sum(t0,_karatsuba_sum(t1,t2))

Is inefficient, 2/3 of the operations in this part is adding a ring element with python zero. This was included in original do_karatsuba and was the reason to rewrite it in #10255.

I get the following numbers with the patches:

sage: R.<a,b> = QQ[]
sage: K.<t> = PowerSeriesRing(R)
sage: time p1 = (1 + a*t + b*t^2 + O(t^50))^-40

No patch: 8.92

multiplication2: 0.66

alternative: 0.45

sage: K.<x>=PowerSeriesRing(QQ[I])
sage: p1 = K.random_element(800)
sage: p2 = K.random_element(800)

No patch: 1.59

#10255: 0.24

multiplication2: 1.57

alternative2: 0.19

sage: R.<a,b> = QQ[]
sage: K.<t> = PowerSeriesRing(R)
sage: p1 = QQ[a,b][t].random_element(800) + O(t^800)
sage: p2 = QQ[a,b][t].random_element(800) + O(t^800)

No patch: 5.75

#10255: 5.25

multiplication2: 4.62

alternative2: 2.41

sagetrac-pernici · 2010-12-28T12:34:17Z

comment:8

You are right, your patch is faster on generic univariate series; I will look at it.

In the related ticket #1956 on multivariate series do_mul_trunc_generic is faster than Karatsuba even using your improved version, in the benchmarks I posted there.

lftabera · 2010-12-28T17:37:23Z

comment:9

In that case, one can play with set_karatsuba_threshold in the underlying univariate polynomial ring to try to find a better balance between karatsuba and classical multiplication.

nilesjohnson · 2010-12-29T16:04:57Z

comment:10

Replying to @lftabera:

In that case, one can play with set_karatsuba_threshold in the underlying univariate polynomial ring to try to find a better balance between karatsuba and classical multiplication.

Optimizing multivariate power series for the algorithms here is now #10532. Since I've never heard of the Karatsuba algorithm before, could someone here point me to a reference which would explain it well enough for me to know what the better balance is? (or just tell me :)

thanks,
Niles

p.s. thanks to those of you involved with this -- I think it's awesome to have these tailored algorithms in Sage!

lftabera · 2011-01-03T09:56:51Z

comment:11

Well, Karatsuba is just the first algorithm you can find on faster polynomial multiplication. There is no good balance that will work for every ring. For a fixed ring it even depends on the specific input as has been shown in Mario's examples.

virtually any book on computer algebra will mention Karatsuba method of multiplication. You could take a look, for example, to Joachim von zur Gathen, Jürgen Gerhard, "Modern Computer Algebra" were Karatsuba and Schönhage–Strassen methods are discussed.

sagetrac-pernici · 2011-01-09T17:32:45Z

comment:12

lftabera wrote:

In that case, one can play with set_karatsuba_threshold in the underlying univariate polynomial ring to try to find a better balance between karatsuba and classical multiplication.

Right, one can always use your Karatsuba code; in the case of multivariate
series it seems to me that the best value for K_threshold is infinity.
I have posted some benchmarks in #10532; I have no idea why,
but the case K_threshold =infinity and classical multiplication
perform the same for QQ and GF(7), while the former performs much better
than the latter in the case of RR; I guess that this might generalize to
the statement that the former performs much better than the latter in the
case of (a class of) fields not covered by libsingular.

I do not think that optimizing the code to avoid summing to Python 0
would change much the performance of do_mul_trunc,
since that part of the code takes little time.
I think that do_trunc_karatsuba is faster than do_mul_trunc
because in the former there is a clever split

left_even  = left[::2]; left_odd   = left[1::2]

corresponding to
separating the polynomial p(t) in p_even(t^2) and t*p_odd(t^2)
so that the precision prec in t is reduced to the precision n1=(prec+1)//2 in t^2

l = do_trunc_karatsuba(left_even, right_even, n1, K_threshold)

while in the latter b and d are the polynomials reduced to O(t^e),
with e close to prec//2, so that the product is done at full precision

t0 = do_mul_trunc(b,d,h), with h = prec.

The middle products in the two cases are comparable.

Maybe you could add a comment about the splitting in do_trunc_karatsuba?
Also there is in the docstring of do_generic_product the misprint bellow .

lftabera · 2011-01-09T19:59:41Z

comment:13

I will take a look and include your suggestions.

sagetrac-pernici · 2011-02-05T16:54:02Z

comment:14

In ticket #10720 I used truncated multiplication from this ticket to speedup
nth_root and inversion for series on polynomial rings. The speedup increases with
the number of variables.

lftabera · 2013-03-05T14:01:20Z

Dependencies: #10255

lftabera · 2013-03-05T14:01:20Z

comment:15

Previous implementation was not better than full multiplication. I have added a patch with original Mulder's short multiplication. If the cost of operations is similar for all coefficients of the power series, then the short multiplication compatible with karatsuba will be faster than the naive one.

Said this, for multivariate power series, typically higher terms are big polynomials with significant multiplication cost. That is why the generic multiplication is better in those cases.

Apply: trac_10480_fast_Powerseries_Mulder.patch

lftabera · 2013-03-05T14:09:47Z

Mulder algorithm

lftabera · 2013-03-05T14:10:33Z

comment:16

Attachment: trac_10480_fast_Powerseries_Mulder.patch.gz

Apply: trac_10480_fast_Powerseries_Mulder.patch

lftabera · 2013-03-06T12:13:48Z

comment:17

Use only Mulders method patch

Apply: trac_10480_fast_Powerseries_Mulder.patch

lftabera · 2013-05-15T08:52:04Z

Changed author from mario pernici to Mario Pernici, Luis Felipe Tabera Alonso

lftabera · 2013-05-15T08:52:04Z

comment:20

Attachment: trac_10480_fast_PowerSeries_poly_multiplication-alternative.patch.gz

I made some experiments and Mulder's algorithm is generally worse and when it is better, the gain is residual so I go back to my first proposal.

Note that the underlying series in #10532 are rather special, so this method as is is not the best for that case.

Apply: trac_10480_fast_PowerSeries_poly_multiplication-alternative.patch

rwst · 2014-04-22T09:53:42Z

comment:23

Have you looked at the implementation of Karatsuba and Schönhage-Strassen in GMP-ECM? Quote:

GMP-ECM features Schönhage-Strassen multiplication for polynomials in
stage 2 when factoring Fermat numbers (not in the new, fast stage 2 for
P+1 and P-1. This is to be implemented.) This greatly reduces the number of
modular multiplications required, thus improving speed. It does, however,
restrict the length of the polynomials to powers of two, so that for a given
number of blocks (-k parameter), the B2 value can only increase by factors of
approximately 4.
For the number of blocks, choices of 2, 3 or 4 usually give best performance.
However, if the polynomial degree becomes too large, relatively expensive
Karatsuba or Toom-Coom methods are required to split the polynomial before
Schönhage-Strassen's method can handle them. That can make a larger number
of blocks worthwhile. ...

zimmermann6 · 2014-04-22T11:21:29Z

comment:24

I'm new to this ticket, thus please forgive me if some comments are out of scope.

About Mulders' algorithm, using a fixed cutoff point k = 25/36*n is not optimal in
practice, where it is better to tune the best k for each n (up to say n=1024) as we do in GNU MPFR.

Since truncated (polynomial) multiplication is potentially useful in other parts of Sage, it would make sense to have the code implementing truncated multiplication and the one using it for power series in two different tickets, no?

Paul

fchapoton · 2014-11-02T17:52:18Z

comment:27

made a git branch on 6.4.rc1

New commits:

`af07389`	`#10480: fast PowerSeries_poly multiplication`
`69e21b9`	`trac #10480 doc and code formatting`

fchapoton · 2014-11-02T17:52:18Z

Branch: public/ticket/10480

fchapoton · 2014-11-02T17:52:18Z

Commit: 69e21b9

sagetrac-git · 2015-02-18T09:16:28Z

Changed commit from 69e21b9 to e2cdc54

sagetrac-git · 2015-02-18T09:16:28Z

Branch pushed to git repo; I updated commit sha1. New commits:

`e2cdc54`	`Merge branch 'public/ticket/10480' into 6.5`

fchapoton · 2015-02-23T19:41:34Z

comment:29

Branch does not apply

videlec · 2017-12-19T10:59:40Z

comment:30

We do have now truncated multiplication on polynomials

sage: p = ZZ['x']('x^20 + 3 * x^15 - 2*x + 1')
sage: p._mul_trunc_(p, 4)
4*x^2 - 4*x + 1

zimmermann6 · 2017-12-19T17:36:39Z

comment:31

however this does not seem faster than plain multiplication (here with Sage 8.0):

sage: q=2^1000000-1
sage: p=IntegerModRing(q)['x'].random_element(degree=8)
sage: time a=p*p
CPU times: user 473 ms, sys: 8.05 ms, total: 481 ms
Wall time: 481 ms
sage: time b=p._mul_trunc_(p,8)
CPU times: user 478 ms, sys: 3.93 ms, total: 482 ms
Wall time: 482 ms

videlec · 2017-12-19T20:14:50Z

comment:32

Replying to @zimmermann6:

however this does not seem faster than plain multiplication (here with Sage 8.0):

It does depend on the base ring

sage: p = ZZ['x'].random_element(degree=20)
sage: %time a = p*p
CPU times: user 31 µs, sys: 2 µs, total: 33 µs
Wall time: 35 µs
sage: %time b = p._mul_trunc_(p,8)
CPU times: user 23 µs, sys: 1 µs, total: 24 µs
Wall time: 27.9 µs

But you are right that the generic version of _mul_truc_ is nothing more than multiply and truncate.

sagetrac-pernici mannequin added this to the sage-4.6.1 milestone Dec 16, 2010

sagetrac-pernici mannequin added c: commutative algebra labels Dec 16, 2010

sagetrac-pernici mannequin assigned malb Dec 16, 2010

lftabera added the s: needs work label Dec 16, 2010

lftabera modified the milestones: sage-4.6.1, sage-4.6.2 Dec 16, 2010

lftabera added s: needs review and removed s: needs work labels Mar 5, 2013

This comment has been minimized.

Sign in to view

lftabera added s: needs review and removed s: needs work labels May 15, 2013

jdemeyer modified the milestones: sage-5.11, sage-5.12 Aug 13, 2013

sagetrac-vbraun-spam mannequin modified the milestones: sage-6.1, sage-6.2 Jan 30, 2014

sagetrac-vbraun-spam mannequin modified the milestones: sage-6.2, sage-6.3 May 6, 2014

sagetrac-vbraun-spam mannequin modified the milestones: sage-6.3, sage-6.4 Aug 10, 2014

fchapoton added s: needs work and removed s: needs review labels Feb 23, 2015

fchapoton modified the milestones: sage-6.4, sage-6.6 Feb 23, 2015

mkoeppe removed this from the sage-6.6 milestone Dec 29, 2022

nilesjohnson mannequin mentioned this issue Dec 29, 2010

Faster multiplication for multivariate power series #10532

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast PowerSeries_poly multiplication #10480

fast PowerSeries_poly multiplication #10480

sagetrac-pernici mannequin commented Dec 16, 2010

sagetrac-pernici mannequin commented Dec 16, 2010

lftabera commented Dec 16, 2010

lftabera commented Dec 17, 2010

lftabera commented Dec 21, 2010

sagetrac-pernici mannequin commented Dec 22, 2010

sagetrac-pernici mannequin commented Dec 22, 2010

lftabera commented Dec 23, 2010

lftabera commented Dec 28, 2010

sagetrac-pernici mannequin commented Dec 28, 2010

lftabera commented Dec 28, 2010

nilesjohnson mannequin commented Dec 29, 2010

lftabera commented Jan 3, 2011

sagetrac-pernici mannequin commented Jan 9, 2011

lftabera commented Jan 9, 2011

sagetrac-pernici mannequin commented Feb 5, 2011

lftabera commented Mar 5, 2013

lftabera commented Mar 5, 2013

lftabera commented Mar 5, 2013

lftabera commented Mar 5, 2013

lftabera commented Mar 6, 2013

lftabera commented May 15, 2013

lftabera commented May 15, 2013

This comment has been minimized.

rwst commented Apr 22, 2014

zimmermann6 commented Apr 22, 2014

fchapoton commented Nov 2, 2014

fchapoton commented Nov 2, 2014

fchapoton commented Nov 2, 2014

sagetrac-git mannequin commented Feb 18, 2015

sagetrac-git mannequin commented Feb 18, 2015

fchapoton commented Feb 23, 2015

videlec commented Dec 19, 2017

zimmermann6 commented Dec 19, 2017

videlec commented Dec 19, 2017

fast PowerSeries_poly multiplication #10480

fast PowerSeries_poly multiplication #10480

Comments

sagetrac-pernici mannequin commented Dec 16, 2010

sagetrac-pernici mannequin commented Dec 16, 2010

lftabera commented Dec 16, 2010

lftabera commented Dec 17, 2010

lftabera commented Dec 21, 2010

sagetrac-pernici mannequin commented Dec 22, 2010

sagetrac-pernici mannequin commented Dec 22, 2010

lftabera commented Dec 23, 2010

lftabera commented Dec 28, 2010

sagetrac-pernici mannequin commented Dec 28, 2010

lftabera commented Dec 28, 2010

nilesjohnson mannequin commented Dec 29, 2010

lftabera commented Jan 3, 2011

sagetrac-pernici mannequin commented Jan 9, 2011

lftabera commented Jan 9, 2011

sagetrac-pernici mannequin commented Feb 5, 2011

lftabera commented Mar 5, 2013

lftabera commented Mar 5, 2013

lftabera commented Mar 5, 2013

lftabera commented Mar 5, 2013

lftabera commented Mar 6, 2013

lftabera commented May 15, 2013

lftabera commented May 15, 2013

This comment has been minimized.

rwst commented Apr 22, 2014

zimmermann6 commented Apr 22, 2014

fchapoton commented Nov 2, 2014

fchapoton commented Nov 2, 2014

fchapoton commented Nov 2, 2014

sagetrac-git mannequin commented Feb 18, 2015

sagetrac-git mannequin commented Feb 18, 2015

fchapoton commented Feb 23, 2015

videlec commented Dec 19, 2017

zimmermann6 commented Dec 19, 2017

videlec commented Dec 19, 2017