Comparing polynomial multiplication routines

Each of two inputs is a pair of exponents and coefficients. The exponents
are necessarily single-word integers. The coefficients are potentially
arbitrary precision, but for these tests are set to numbers less that 2^16,
so that none of the examples would necessitate an actual bignum.

Each routine returns such a pair, representing the product.

D  dense multiplication  [our program: mul-dense]
F  fast Fourier transform (FFTW double-float), subject to roundoff. [mulfft3]
K  Karatsuba (Divide and conquer)  [polymultk]

K was doing so badly we decided to give it a head start and didn't charge it
for converting inputs to an array of double-floats. Even so, it didn't do so well.

Results.

For small polynomial problems, e.g. dense 10 term by 50 term, or even 50 term by 50 term,
D is best, typically by 5X or more.  


The rest of the results are grouped according to the sum of the number of terms
in the inputs.

For instance,the first section indicates which program is best when multiplying polynomials with a total of 
about 100 terms.

spacing [1,1],  meaning ``dense'' -- the space between adjacent exponents is in the range [1,1].
10x90 D (2.5X faster)
20x80 D
30X70 F
50X50 F
40X60 F

What if about half the coefficients are zero?  (spacing randomly in [1,2])

10x90 D (3X faster)
20x80 D
30X70 D
40X60 F (close)
50x50 D (close)
 
spacing [1,4]
10x90 D (2.5X faster)
20x80 D
30X70 D
40X60 D
50x50 D

spacing [1,8]
10x90 D (5X faster)
20x80 D
30X70 D
40X60 D
50x50 D (2X faster)

In this section we have problems with a total of about 100 terms in the two inputs.

spacing [1,1]
50x150  K  (! surprise. 1.4X faster than F, 3.3 faster than D)
100x100 K  (! surprise. 1.4X faster than F, 3.2 faster than D)

spacing [1,2]
50x150  F
100x100 F

spacing [1,4]
50x150  D
100x100 D

spacing [1,8]
50x150  D
100x100 D

spacing [1,16]
50x150  D  (3X faster)
100x100 D  (2.4X faster)


spacing [1,32]
50x150  D  (4X faster)
100x100 D  (3.5X faster)



For larger problems, the advantage for the FFT is considerable

...... [1,1]
100X900 F (5X faster)
200X800 F (9X faster)
300X700 F (10X faster)
400X600 F (15X faster)
500X500 F (15X faster)

Even if the polynomials are slightly sparse

[1,2]

100X900 F (3X faster)
200X800 F (5X faster)
300X700 F (7X faster)
400X600 F (9X faster)
500X500 F (9X faster)

[1,4]
100X900 F (2X faster)
...

[1,8]
100X900 F (almost a tie)
200X800 F (2X faster)
...

The advantage disappears for polynomials that are yet sparser, with an average of 8.5 zeros between
the non-zero coefficients.

[1,16]
100X900 D (2X faster)
200x800 D (almost a tie)
300x700 F (1.2X faster)
400x600 F (1.3X faster)
500x500 F (1.5X faster)

For very large dense problems the FFT advantage is enormous, though it
starts to pale if the problems are sparse.

1000X2000 F (300X faster)

[1,2]
1000X2000 F (15X faster)

[1,4]
1000X2000 F (15X faster)

[1,32]
1000X2000 F (3X faster)

Sometimes Karatsuba beats D, but usually in those cases F beats Karatsuba.
For small cases Karatsuba simply calls D, so it will never beat it D there.

 Here are two additional Karatsuba winners we found:
[1,1]
100X200 K (almost a tie with F)
50X350 K  (almost a tie with F)

and as an example, 5000x5000 K beats D by 4X,  but F is a winner over D by 25X.


What about a naive list-based program? This competes effectively and
way beats FFT when the data become really sparse.  Thus

[1,512]
50X50 N (1.5X faster than D, 31X faster than FFT).  This is multiplying degree 12,649 by 13,982.

what about heap-based (H) ?  We never got a correct implementation
that seemed to incorporate all of Monagan/Pearce improvements, so it
might be better than we indicate here.

Yes, it does appear superior, but only for a class of extremely sparse cases. 
 In our testing, it first becomes superior to all the alternatives here:

[1, 512]
500X500 H (1.1X faster than D, 5X faster than F) These are polynomials of degree 125,000.
[1, 1024]
500X500 H (1.1X faster than D, 8.5X faster than F) These are polynomials of degree 250,000.
[1, 16384]
500X500 H (3.6X faster than D.) These are polynomials of degree 4.1 million

Somewhat surprisingly, D is still competitive. 

More notes on the FFT.

In our tests here we have already given the FFT an advantage in that
we are ignoring the possibility of overflow, roundoff error or other
problems in converting to floats and back to exact integers; D and K
automatically take care of this.  Even so, it is worth exploring how
we can give the FFT even more benefits; perhaps in a particular
application these are justified.

1. Provide an array of double-floats for each input, so that the
exponent/coefficient processing to load up from a sparse
representation into an array is unnecessary.

2. Allow the answer to be an array of double-floats without rounding
to the nearest integer, and not converting coefficients that round to
zero into actual zeros.

3. We could provide slightly more benefit by pre-computing "optimal"
FFT "plans" for each anticipated transform size in advance of its
use. Tools for this are provided by the FFTW library. In our limited
testing, we found that using FFT plans based on what the FFTW
programmers call "estimates" was quick and adequate.  At one point we
did compute more elaborate plans, but ultimately left it out since the
plans interacted with memory buffer locations in an inconvenient
manner and seemed to have speedup effects in the range of 0 to 20
percent.

Anyway, we found that these benefits (1,2) reduced the time for the F
algorithm by a factor of between 3.6 and 5.2.  The larger number was
for the sparse case, in which the FFT program was returning a result
that was substantially unattractive: an array with coefficients that
were close to zero.


There is also the issue of the FFT finite precision: What is the cost
if the FFT implementation is used as the basis for a modular exact
multi-word answer? 

(If an alternative "bignum" FFT is used, it would probably be a factor
of 10 or more slower than double-float, probably making it
unattractive. A finite-field FFTW might be possible.)