Preparing for v2.3 #170

manodeep · 2018-09-29T00:51:59Z

Will include avx512
Additional optimizations based on min. separations between cell-pairs

…test, rpavg and weightavg are still wrong

… been valgrind'ed but the test passes. Only compiles with icc right now

…sters

… that there are many FMA options available

…ask horizontal add because that was only supported by intel compilers and composed of multiple separate intrinsics. Fixed the missing closing brace from c++ compilers in the sse42 header

…use I could not get the integer blends to work with gcc. So the AVX2 is really identical to the AVX implementation except for the fma involved

…FMA calls for different instruction sets

…hin the histogram update loop to a faster bitwise operation. Changed the floating point operations to the quiet kinds from the signalling kinds. Added in some more masked operations to the avx512 header, and removed ops that do not carry over from avx

…he speedup is not that impressive though

… Removed extraneous comment symbols that were causing compile failures

…ci skip]

…x512

…for the second set of points to the AVX kernel

…ld failure

astropy-bot · 2018-09-29T00:52:02Z

Hi there @manodeep 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃.

Everything looks good from my point of view! 👍

If there are any issues with this message, please report them here.

manodeep · 2018-09-29T01:13:04Z

@lgarrison I will merge this PR in for now but we will need a proper review once this branch is ready to merge into master

…le positions, array of cell-pairs (#173) * WIP: Added the machinery for quicker exits * WIP: Adding in early exits to the theory routines * Break out of next j-loop if any dz values in current iteration are larger than 'max_dz' * [WIP] Started implementing the min sep optimizations for mocks * Preparing for v2.3 (#170) * started basic work on avx512 (completely wrong results currently) * Fixed the number of bits set in the mask. npairs now agrees with the test, rpavg and weightavg are still wrong * Trying mask loads. everything is broken now, including npairs * Working version with AVX512 (requires AVX512VL, ie Skylake cpus). Not been valgrind'ed but the test passes. Only compiles with icc right now * Compiles with gcc7.3 (but not with gcc6.4) - might be a compiler bug * AVX512 is awesome! Supports masked horizontal adds across vector registers * Cleaned up the logic in the initial part of the loop. * Improved handling of missing numpy. Renamed the fma macros to reflect that there are many FMA options available * Moved all the union declarations into the header files. Removed the mask horizontal add because that was only supported by intel compilers and composed of multiple separate intrinsics. Fixed the missing closing brace from c++ compilers in the sse42 header * Added the AVX2 implementation for wp. No performance improvement because I could not get the integer blends to work with gcc. So the AVX2 is really identical to the AVX implementation except for the fma involved * mostly finished with the avx512 * Cleaned up the dependency statements in the Makefiles. Protected the FMA calls for different instruction sets * Replaced the mask loads with maskz loads. Replaced the comparison within the histogram update loop to a faster bitwise operation. Changed the floating point operations to the quiet kinds from the signalling kinds. Added in some more masked operations to the avx512 header, and removed ops that do not carry over from avx * Should not compile python extensions if python/numpy are not available * Added the avx512f kernels for DDrppi_mocks. Integration tests pass. The speedup is not that impressive though * Bumped version * Silencing compiler warning about meaningless type qualifier * Replaced the Newton-Raphson steps with FMA equivalents * Fixed a logic error for the AVX512 case with count_vectorized option. Removed extraneous comment symbols that were causing compile failures * Updated the scripts for benchmarking the speedup from avx512f * Added the DD kernel. Integration tests pass * Added the xi kernel. Integration tests pass * Added the avx512f kernel for DDrppi. Integration tests pass * Replaced the set zero to the actual intrinsic dedicated to that purpose * Fixed the setzero int call * Added the DDsmu pair counter. Not real speed improvement; tests pass * Cleaned up the initial search for valid pairs in wp * Added the DDsmu_mocks (compiles but no checks have been run) * Added the DDtheta kernel (not compiled, tested or debugged) * fast divide options needs to be added to DDsmu theory at some point [ci skip] * Fixed bug in DDtheta. Integration tests pass now * Removed unused variable. Tests pass for DDsmu_mocks * Adding in the fast_divide option to theory/DDsmu paircounter. Not tested * Fixing the typos in fast-divide part of DDsmu * Fixed typo. And ported the nicer way of figuring out the start index for the second set of points to the AVX kernel * Removed icc as default compiler * Attempting to fix travis failures from AVX512 code * Removing the old xcode6.4 - seems to be not supported on travis any more * No longer counting blank lines. Partly fixes #160 * Thetamin of 0.0 is allowed * Fixing the build failure in #168 * Adding in the fast_divide option to the kernels. hopefully fixing build failure * Fixing build failure for numpy dependency * Made a changelog entry [ci skip] * conda uses secure channels now * Corrected the PR # [ci skip] * Added in the PR # to the min separations [ci skip] * Fixing build failure * Added in the min sep into the avx512 kernels * Added the avx512 intrinsic for computing the NOT of a mask * Added the avx512 kernel for vpf * Added in the boolean option for min_sep_opt to theory pair-counters * Noted that avx512f is now available * Changed docstrings for avx512f * Added the min_sep_opt keyword handling to the C extension for python * Integration tests pass for min. sep optimizations for theory routines. * Integration tests pass for mocks * Added the min-sep-opt handling to the python extension and the wrappers * Ignoring the files with output from integration tests [ci skip] * Added in the enable_min_sep option to the extensions * Fix benchmark scripts for Python 3. Make Python output redirection not break if output has already been redirected. * Added an option to generate SIMD speedup from numpart/rmax scalings * Added the speedup tables * Added in min-sep based optimizations for xi * Option to turn off min_sep_opt for paper tests * Added in a bounds to the lattice to compute min separations * Fix min-sep optimization in xi kernel; needs porting to other kernels * Added in the min_sep_optimizations for xi * Removing zmin pointer since code seems to run slower. Tests pass * Added the z1_min pointer back in * Removed zmin again since the tests are 10\% slower * Added in min_sep_optmizations for wp. Integration tests pass * Added in the updated gridlinking code for 2D separations * Fixing the doctest failure (white-space issues) * Undoing (mostly) the last commit * Should fully undo the whitespace fix that broke the build * Adding a continue with max_dz condition check * Added in min-sep-opt for DD. Integration tests pass * Added min sep opt for DDrppi. Integration tests pass * Adding DDsmu min-sep-opt and gridlink fixes. Untested * Attempting to fix the bounding box calculations. * Tried to make the variable naming conventions clearer. Plus, continue statements in simd modes * Fixing compile failure * Fixing (integration) test failure in xi * Removed the negative pimax check * Corrected the logic for updating min distance between cell-pairs back to the original implementation * Moved the assignment after all the bounding box checks * Added min-sep-opt for mocks DDrppi * Fixing compile failure * Fixed (inconsequential) typos [ci skip] * Added min-sep-opt for DDsmu mocks. Integration tests pass * Added an integration test by default * Trying to fix (travis) compile failure for integration tests * Splitting up the make tests into two stages * Fixed whatspace errors * Forgot to pass through the min-sep-opt option to the python extension * Perhaps a missing whitespace before semi-colon * Updated the min_dz calculation * Added the min_dz update to the mocks * Fixed #161 * Changed the variable type to int64_t for the rmin==0.0 bin counts correction. Integration tests PASS for all mocks and all theory routines * Bumped version * Copied from the numpy setup.py file. Deleting the Corrfunc setup env var * Trying to fix travis failure for integration tests * Added the PR into the changelog [ci skip] * The warning was failing the build * Fixed the compile failure with gcc * The bug-fix should only be for gcc <= gcc8 and not for other compilers * The integration tests exceed time-limit on travis - removing * Only print the missing openmp support for Apple clang once * Dropped xcode7.3 from travis - seems to be causing build failure (due to wurlitzer) * Only print the openmp warning once * WIP: Adding min-sep-opt for wtheta (based on chord separation) * WIP: Min-sep-opt for DDtheta mocks * Removed cz from the mocks lattice structure * WIP: Tests fail for DDtheta * Fix weights corruption when using brute-force DDtheta kernel. Now appears to fail npairs by one particle when linking in RA. * Improved the error message under test failure * WIP: Possibly a working solution for DDtheta. Integration tests not done yet * Changing the avg_np calculation to include both datasets * Fixed the DDtheta bug * Propagated the -max_dz search across all relevant pair-counters * Only print the boosting bin-ref message in verbose mode * Should only print the info message in verbose mode * Fixed bug in xi * Removed unnecessary if condition * Had missed adding the double-counting check in the brute force wtheta (#161) * DDtheta brute-force is always a cross-corr (#161) * Added a few missing error messages about malloc failures * Added in the low-32 bits multiplication for avx512 * Only printing the clang-openmp message once * Use int instead of int8 for max_nmesh. Fixes #179 (grid refinement bug). * WIP: Option to use the particle positions in-place and returning an array of cell-pairs * WIP: Implementations for the theory paircounters * Adding the python bindings and tests for theory paircounters * Added in the new config options for the python wrappers * Attempting to fix build failure and pep8 issues * Fixed docstrings and updated changelog * Fixing docstrings [ci skip] * Fixed another docstring issue [ci skip] * More docstring fixes * Docstring fixes plus added big/little endian fix to DDsmu * Renamed copy_particle_positions to copy_particles (since both positions and weights are copied) * Reordered the sequence when running integration tests * Header for the new cell-pair struct, only one theory cellarray struct now (rather than two) * Initializing the xmin/xmax etc variables to floating limits * Created a new file containing gridlink utilities * Removed the reorder option completely. Fixed memory leaks during integration tests * Added the copy_particles option for mocks. Passing sqr_smax/sqr_smin into the DDsmu kernels instead of smax/smin * Added the copy_particles (and min-sep-opt for DDtheta) into mocks python extension * Fixed up the docs in the python theory extensions * A fix for automatic cache linesize detection (not used currently) * Added the copy_particles to the mocks python wrappers * Forgot to pass on the copy_particles value into the python extensions * Free the memory only when tests are successful * Fixing build failure * Removed duplicate function prototype * Fixed one set of memory leaks * More fixes for memory leaks * Changed the particle numbers in the tests to be 64-bit integers * Added the sanitize options if running on travis/other CI * Added utility functions to find the min and max separation between two cells * Fixed compile failure * Commenting out the fsanitize options to fix 'unknown compiler option' compile failure * Making copy_particles be the default * setting copy_particles=True as the default in theory * Fixed memory leaks with ra-dec linking and DDtheta * Add in the sanitize flags when running integration tests * Fix automatic uniform weight arrays and broadcasting of scalar weights. Also fixes weights reference leak. Closes #180 and #181. * Update changelog [ci skip] * * Fix doctest backwards compatibility with old Numpy print formatting * Edit Travis config to only build PRs and master branch (eliminates duplicate builds in PRs) * Fix indentation in utils.py * Update changelog Trying to appease astropy-bot * Another attempt at changelog * Fix changelog formatting? [ci skip] * Fix RST parser warnings in change log [ci skip]

manodeep added 30 commits March 22, 2018 18:29

started basic work on avx512 (completely wrong results currently)

c4056a7

Fixed the number of bits set in the mask. npairs now agrees with the …

ce978f4

…test, rpavg and weightavg are still wrong

Trying mask loads. everything is broken now, including npairs

ccf47e6

Working version with AVX512 (requires AVX512VL, ie Skylake cpus). Not…

638f9fe

… been valgrind'ed but the test passes. Only compiles with icc right now

Compiles with gcc7.3 (but not with gcc6.4) - might be a compiler bug

904715a

AVX512 is awesome! Supports masked horizontal adds across vector regi…

55ce36a

…sters

Cleaned up the logic in the initial part of the loop.

48e4f6b

Improved handling of missing numpy. Renamed the fma macros to reflect…

b0f0b81

… that there are many FMA options available

Moved all the union declarations into the header files. Removed the m…

8a9784c

…ask horizontal add because that was only supported by intel compilers and composed of multiple separate intrinsics. Fixed the missing closing brace from c++ compilers in the sse42 header

Added the AVX2 implementation for wp. No performance improvement beca…

0e3122e

…use I could not get the integer blends to work with gcc. So the AVX2 is really identical to the AVX implementation except for the fma involved

mostly finished with the avx512

eb71b3a

Cleaned up the dependency statements in the Makefiles. Protected the …

350d97a

…FMA calls for different instruction sets

Should not compile python extensions if python/numpy are not available

661af64

Added the avx512f kernels for DDrppi_mocks. Integration tests pass. T…

4386083

…he speedup is not that impressive though

Bumped version

e0a0daf

Silencing compiler warning about meaningless type qualifier

d65be85

Replaced the Newton-Raphson steps with FMA equivalents

ae76711

Fixed a logic error for the AVX512 case with count_vectorized option.…

dea1d18

… Removed extraneous comment symbols that were causing compile failures

Updated the scripts for benchmarking the speedup from avx512f

c1323ff

Added the DD kernel. Integration tests pass

bedfd16

Added the xi kernel. Integration tests pass

6b58e17

Added the avx512f kernel for DDrppi. Integration tests pass

76a5784

Replaced the set zero to the actual intrinsic dedicated to that purpose

5dc64d0

Fixed the setzero int call

d8e83f3

Added the DDsmu pair counter. Not real speed improvement; tests pass

ff926cf

Cleaned up the initial search for valid pairs in wp

e30cee9

Added the DDsmu_mocks (compiles but no checks have been run)

cb7a073

Added the DDtheta kernel (not compiled, tested or debugged)

561bfeb

fast divide options needs to be added to DDsmu theory at some point […

a566f43

…ci skip]

manodeep added 17 commits April 29, 2018 22:16

Merge branch 'avx512' of https://github.com/manodeep/Corrfunc into av…

30696ba

…x512

Fixed bug in DDtheta. Integration tests pass now

16470c2

Removed unused variable. Tests pass for DDsmu_mocks

d3be28d

Adding in the fast_divide option to theory/DDsmu paircounter. Not tested

c741056

Fixing the typos in fast-divide part of DDsmu

00c0ac5

Fixed typo. And ported the nicer way of figuring out the start index …

4a13dba

…for the second set of points to the AVX kernel

Removed icc as default compiler

2498523

Attempting to fix travis failures from AVX512 code

0573375

Removing the old xcode6.4 - seems to be not supported on travis any more

51559b2

No longer counting blank lines. Partly fixes #160

285791d

Thetamin of 0.0 is allowed

f3f95b4

Merge branch 'master' into avx512

91ce6ba

Fixing the build failure in #168

a28cdb7

Adding in the fast_divide option to the kernels. hopefully fixing bui…

82d6ad9

…ld failure

Fixing build failure for numpy dependency

52572a1

Made a changelog entry [ci skip]

b871725

conda uses secure channels now

db9c19c

Merge branch 'min_sep_optimizations' into avx512

6ad6106

manodeep added this to the v2.3.0 milestone Sep 29, 2018

manodeep mentioned this pull request Sep 29, 2018

avx512 kernels #167

Closed

manodeep requested a review from lgarrison September 29, 2018 00:57

manodeep added 2 commits September 29, 2018 11:06

Corrected the PR # [ci skip]

5ea7f49

Added in the PR # to the min separations [ci skip]

4463e52

manodeep merged commit 52069c9 into min_sep_optimizations Sep 29, 2018

manodeep deleted the avx512 branch September 29, 2018 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preparing for v2.3 #170

Preparing for v2.3 #170

manodeep commented Sep 29, 2018

astropy-bot bot commented Sep 29, 2018 •

edited

Loading

manodeep commented Sep 29, 2018

Preparing for v2.3 #170

Preparing for v2.3 #170

Conversation

manodeep commented Sep 29, 2018

astropy-bot bot commented Sep 29, 2018 • edited Loading

manodeep commented Sep 29, 2018

astropy-bot bot commented Sep 29, 2018 •

edited

Loading