Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
AVX512F kernels, min separation optimizations, not duplicating partic…
…le positions, array of cell-pairs (#173) * WIP: Added the machinery for quicker exits * WIP: Adding in early exits to the theory routines * Break out of next j-loop if any dz values in current iteration are larger than 'max_dz' * [WIP] Started implementing the min sep optimizations for mocks * Preparing for v2.3 (#170) * started basic work on avx512 (completely wrong results currently) * Fixed the number of bits set in the mask. npairs now agrees with the test, rpavg and weightavg are still wrong * Trying mask loads. everything is broken now, including npairs * Working version with AVX512 (requires AVX512VL, ie Skylake cpus). Not been valgrind'ed but the test passes. Only compiles with icc right now * Compiles with gcc7.3 (but not with gcc6.4) - might be a compiler bug * AVX512 is awesome! Supports masked horizontal adds across vector registers * Cleaned up the logic in the initial part of the loop. * Improved handling of missing numpy. Renamed the fma macros to reflect that there are many FMA options available * Moved all the union declarations into the header files. Removed the mask horizontal add because that was only supported by intel compilers and composed of multiple separate intrinsics. Fixed the missing closing brace from c++ compilers in the sse42 header * Added the AVX2 implementation for wp. No performance improvement because I could not get the integer blends to work with gcc. So the AVX2 is really identical to the AVX implementation except for the fma involved * mostly finished with the avx512 * Cleaned up the dependency statements in the Makefiles. Protected the FMA calls for different instruction sets * Replaced the mask loads with maskz loads. Replaced the comparison within the histogram update loop to a faster bitwise operation. Changed the floating point operations to the quiet kinds from the signalling kinds. Added in some more masked operations to the avx512 header, and removed ops that do not carry over from avx * Should not compile python extensions if python/numpy are not available * Added the avx512f kernels for DDrppi_mocks. Integration tests pass. The speedup is not that impressive though * Bumped version * Silencing compiler warning about meaningless type qualifier * Replaced the Newton-Raphson steps with FMA equivalents * Fixed a logic error for the AVX512 case with count_vectorized option. Removed extraneous comment symbols that were causing compile failures * Updated the scripts for benchmarking the speedup from avx512f * Added the DD kernel. Integration tests pass * Added the xi kernel. Integration tests pass * Added the avx512f kernel for DDrppi. Integration tests pass * Replaced the set zero to the actual intrinsic dedicated to that purpose * Fixed the setzero int call * Added the DDsmu pair counter. Not real speed improvement; tests pass * Cleaned up the initial search for valid pairs in wp * Added the DDsmu_mocks (compiles but no checks have been run) * Added the DDtheta kernel (not compiled, tested or debugged) * fast divide options needs to be added to DDsmu theory at some point [ci skip] * Fixed bug in DDtheta. Integration tests pass now * Removed unused variable. Tests pass for DDsmu_mocks * Adding in the fast_divide option to theory/DDsmu paircounter. Not tested * Fixing the typos in fast-divide part of DDsmu * Fixed typo. And ported the nicer way of figuring out the start index for the second set of points to the AVX kernel * Removed icc as default compiler * Attempting to fix travis failures from AVX512 code * Removing the old xcode6.4 - seems to be not supported on travis any more * No longer counting blank lines. Partly fixes #160 * Thetamin of 0.0 is allowed * Fixing the build failure in #168 * Adding in the fast_divide option to the kernels. hopefully fixing build failure * Fixing build failure for numpy dependency * Made a changelog entry [ci skip] * conda uses secure channels now * Corrected the PR # [ci skip] * Added in the PR # to the min separations [ci skip] * Fixing build failure * Added in the min sep into the avx512 kernels * Added the avx512 intrinsic for computing the NOT of a mask * Added the avx512 kernel for vpf * Added in the boolean option for min_sep_opt to theory pair-counters * Noted that avx512f is now available * Changed docstrings for avx512f * Added the min_sep_opt keyword handling to the C extension for python * Integration tests pass for min. sep optimizations for theory routines. * Integration tests pass for mocks * Added the min-sep-opt handling to the python extension and the wrappers * Ignoring the files with output from integration tests [ci skip] * Added in the enable_min_sep option to the extensions * Fix benchmark scripts for Python 3. Make Python output redirection not break if output has already been redirected. * Added an option to generate SIMD speedup from numpart/rmax scalings * Added the speedup tables * Added in min-sep based optimizations for xi * Option to turn off min_sep_opt for paper tests * Added in a bounds to the lattice to compute min separations * Fix min-sep optimization in xi kernel; needs porting to other kernels * Added in the min_sep_optimizations for xi * Removing zmin pointer since code seems to run slower. Tests pass * Added the z1_min pointer back in * Removed zmin again since the tests are 10\% slower * Added in min_sep_optmizations for wp. Integration tests pass * Added in the updated gridlinking code for 2D separations * Fixing the doctest failure (white-space issues) * Undoing (mostly) the last commit * Should fully undo the whitespace fix that broke the build * Adding a continue with max_dz condition check * Added in min-sep-opt for DD. Integration tests pass * Added min sep opt for DDrppi. Integration tests pass * Adding DDsmu min-sep-opt and gridlink fixes. Untested * Attempting to fix the bounding box calculations. * Tried to make the variable naming conventions clearer. Plus, continue statements in simd modes * Fixing compile failure * Fixing (integration) test failure in xi * Removed the negative pimax check * Corrected the logic for updating min distance between cell-pairs back to the original implementation * Moved the assignment after all the bounding box checks * Added min-sep-opt for mocks DDrppi * Fixing compile failure * Fixed (inconsequential) typos [ci skip] * Added min-sep-opt for DDsmu mocks. Integration tests pass * Added an integration test by default * Trying to fix (travis) compile failure for integration tests * Splitting up the make tests into two stages * Fixed whatspace errors * Forgot to pass through the min-sep-opt option to the python extension * Perhaps a missing whitespace before semi-colon * Updated the min_dz calculation * Added the min_dz update to the mocks * Fixed #161 * Changed the variable type to int64_t for the rmin==0.0 bin counts correction. Integration tests PASS for all mocks and all theory routines * Bumped version * Copied from the numpy setup.py file. Deleting the Corrfunc setup env var * Trying to fix travis failure for integration tests * Added the PR into the changelog [ci skip] * The warning was failing the build * Fixed the compile failure with gcc * The bug-fix should only be for gcc <= gcc8 and not for other compilers * The integration tests exceed time-limit on travis - removing * Only print the missing openmp support for Apple clang once * Dropped xcode7.3 from travis - seems to be causing build failure (due to wurlitzer) * Only print the openmp warning once * WIP: Adding min-sep-opt for wtheta (based on chord separation) * WIP: Min-sep-opt for DDtheta mocks * Removed cz from the mocks lattice structure * WIP: Tests fail for DDtheta * Fix weights corruption when using brute-force DDtheta kernel. Now appears to fail npairs by one particle when linking in RA. * Improved the error message under test failure * WIP: Possibly a working solution for DDtheta. Integration tests not done yet * Changing the avg_np calculation to include both datasets * Fixed the DDtheta bug * Propagated the -max_dz search across all relevant pair-counters * Only print the boosting bin-ref message in verbose mode * Should only print the info message in verbose mode * Fixed bug in xi * Removed unnecessary if condition * Had missed adding the double-counting check in the brute force wtheta (#161) * DDtheta brute-force is always a cross-corr (#161) * Added a few missing error messages about malloc failures * Added in the low-32 bits multiplication for avx512 * Only printing the clang-openmp message once * Use int instead of int8 for max_nmesh. Fixes #179 (grid refinement bug). * WIP: Option to use the particle positions in-place and returning an array of cell-pairs * WIP: Implementations for the theory paircounters * Adding the python bindings and tests for theory paircounters * Added in the new config options for the python wrappers * Attempting to fix build failure and pep8 issues * Fixed docstrings and updated changelog * Fixing docstrings [ci skip] * Fixed another docstring issue [ci skip] * More docstring fixes * Docstring fixes plus added big/little endian fix to DDsmu * Renamed copy_particle_positions to copy_particles (since both positions and weights are copied) * Reordered the sequence when running integration tests * Header for the new cell-pair struct, only one theory cellarray struct now (rather than two) * Initializing the xmin/xmax etc variables to floating limits * Created a new file containing gridlink utilities * Removed the reorder option completely. Fixed memory leaks during integration tests * Added the copy_particles option for mocks. Passing sqr_smax/sqr_smin into the DDsmu kernels instead of smax/smin * Added the copy_particles (and min-sep-opt for DDtheta) into mocks python extension * Fixed up the docs in the python theory extensions * A fix for automatic cache linesize detection (not used currently) * Added the copy_particles to the mocks python wrappers * Forgot to pass on the copy_particles value into the python extensions * Free the memory only when tests are successful * Fixing build failure * Removed duplicate function prototype * Fixed one set of memory leaks * More fixes for memory leaks * Changed the particle numbers in the tests to be 64-bit integers * Added the sanitize options if running on travis/other CI * Added utility functions to find the min and max separation between two cells * Fixed compile failure * Commenting out the fsanitize options to fix 'unknown compiler option' compile failure * Making copy_particles be the default * setting copy_particles=True as the default in theory * Fixed memory leaks with ra-dec linking and DDtheta * Add in the sanitize flags when running integration tests * Fix automatic uniform weight arrays and broadcasting of scalar weights. Also fixes weights reference leak. Closes #180 and #181. * Update changelog [ci skip] * * Fix doctest backwards compatibility with old Numpy print formatting * Edit Travis config to only build PRs and master branch (eliminates duplicate builds in PRs) * Fix indentation in utils.py * Update changelog Trying to appease astropy-bot * Another attempt at changelog * Fix changelog formatting? [ci skip] * Fix RST parser warnings in change log [ci skip]
- Loading branch information