Ci update, bugfixes #68

Matthew-Whitlock · 2023-02-23T21:30:19Z

Improved several CMake semantics:

Variables now prepended with FENIX_
mpirun tests now use cmake variables to ensure forward compatibility
Attempts to fix issues with building on systems which have mpi headers in the default include directories, causing segfaults etc.

Merging in some bug fixes for various recovery modes (ideally a separate pr, but here we are)

Swap from Travis testing to Github actions

* Travis fixes (sandialabs#55) Fix some travis/testing issues. Travis now pulls from ULFM master branch when it needs to rebuild ULFM. Travis has an environment variable enabling oversubscription during the tests, instead of having that on all platforms when running make test Tests that involve failure have their timeouts individually set to 1, so tests don't take 10+ seconds each w/ the default timeout of 10s Simplified travis scripts (no more .travis_helpers directory) * Revert "Travis fixes (sandialabs#55)" (sandialabs#56) Reverting un-reviewer PR, it was meant to be in my fork. This reverts commit a41fd3b. * Update README.md * Merge updates for HCLIB (sandialabs#57) * Add ability to query which processes failed * Add support for MPI_Test * Add support for testing pre-failure requests * Fix bug when ERR_PROC_FAILED/ERR_REVOKED discovered in MPI_Test * Fix MPI_Wait w/ cancelled requests * Add missing file to commit * Fix bug with MPI_STATUS_IGNORE * Fix another bug with MPI_Test * Add no-jump recovery option * Travis fixes (#2) Fix some travis/testing issues. Travis now pulls from ULFM master branch when it needs to rebuild ULFM. Travis has an environment variable enabling oversubscription during the tests, instead of having that on all platforms when running make test Tests that involve failure have their timeouts individually set to 1, so tests don't take 10+ seconds each w/ the default timeout of 10s Simplified travis scripts (no more .travis_helpers directory) * First pass at removing the request store New function, "Fenix_test_cancelled" for checking if pre-failure requests completed or were cancelled. One thing to try finding a solution for: If a failure was found during an MPI_Test, that request has already been removed from MPI internals and replaced w/ MPI_REQUEST_NULL. Fenix_test_cancelled will report that this req was completed * Implement custom errhandler This includes removing the option for comm_replace - users now must provide a comm pointer to fenix_init and cannot rely on fenix to automatically replace their input comm with the resilient comm. * Fenix comms are stack-allocated now, instead of malloced * Cleanup redundant set_errhandler calls * Fix data recovery bug * Add usage instructions to all examples/tests * Add support for MPI_Issend and MPI_Ssend (#3) Merge in Issend test Co-authored-by: mwhitlo@sandia.gov <mwhitlo@sandia.gov> Co-authored-by: sriraj <srirajpaul@gmail.com> Co-authored-by: Keita Teranishi <knteran@sandia.gov> Co-authored-by: mwhitlo@sandia.gov <mwhitlo@sandia.gov> Co-authored-by: sriraj <srirajpaul@gmail.com>

… epizon-project-master

Mostly related to communicator state management

Before, unfortunately placed errors could "overwrite" the reporting info of prior errors without that info ever making it to the user. Now, we guarantee that info at least makes it to the user's first recovery callback. IE, users will guaranteed see a role of FENIX_ROLE_INITIAL_RANK or FENIX_ROLE_RECOVERED_RANK for a process prior to seeing FENIX_ROLE_SURVIVOR_RANK.

MPI_Datatype is vendor-dependent and we aren't allowed to assume anything about it. Right now, ompi implements as a pointer and we segfault on recovery sometimes. Fix unran test, add timeout parameter Remove ompi version expected to fail from tests

…unimplemented feature, remove travis, cmake variable naming conventions

…_update Conflicts: CMakeLists.txt

Conflicts: README.md examples/01_hello_world/fenix/CMakeLists.txt examples/02_send_recv/fenix/CMakeLists.txt examples/05_subset_create/CMakeLists.txt examples/06_subset_createv/CMakeLists.txt src/fenix_data_recovery.c src/fenix_process_recovery.c test/failed_spares/CMakeLists.txt test/issend/CMakeLists.txt test/no_jump/CMakeLists.txt test/request_cancelled/CMakeLists.txt

Matthew-Whitlock · 2023-04-17T20:02:51Z

Tests are run here, until this is merged to enable testing in this repo: https://github.com/Matthew-Whitlock/Fenix/actions/runs/4725316245/jobs/8383578049

Some tests "failed" due to timeout, but that's just a bug with ULFM where MPI_Finalize hangs occasionally.

There's also an interesting segfault that I'm not able to reproduce during one of the tests. It's not something to handle in this PR though (potentially some issue related to ULFM).

nmm0

Just a couple of minor changes/questions to answer. The FENIX_TESTS option one isn't a big deal, I'll leave it up to you whether you want to use that or CMake's BUILD_TESTING

CMakeLists.txt

README.md

CMakeLists.txt

test/subset_internal/CMakeLists.txt

CMakeLists.txt

nmm0

Looks good to me :)

(one minor thing -- BUILD_EXAMPLES isn't a cmake idiom though BUILD_TESTS is 🤷 )

Matthew-Whitlock and others added 17 commits April 26, 2022 09:19

Update run command to newest recommended flags

8496d41

Merge branch 'master' of https://github.com/epizon-project/Fenix into…

a6904cd

… epizon-project-master

Merge changes from main repo

b17cd44

Update instructions to latest ULFM/OpenMPI recommended version

750aeac

Repair files from revert

553d1a6

Fix recovery bugs for poorly-timed failures

4465643

Mostly related to communicator state management

Fix recovery rank placement issues

eb10cca

Fix a bug related to inconsistent state during commit_barrier

efe476c

Implement Github actions for testing

d7472e2

Implement MPI system include fix

347aaa0

Improved system include fixes, removed reference to (in this branch) …

d6e33e3

…unimplemented feature, remove travis, cmake variable naming conventions

Merge branch 'ci_update' of github.com:matthew-whitlock/fenix into ci…

428f603

…_update Conflicts: CMakeLists.txt

Update install directions, another cmake variable naming convention fix

087f57e

Matthew-Whitlock marked this pull request as ready for review April 17, 2023 20:12

Matthew-Whitlock requested a review from nmm0 April 17, 2023 20:12

nmm0 requested changes May 1, 2023

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

test/subset_internal/CMakeLists.txt Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

Matthew-Whitlock and others added 2 commits May 8, 2023 13:37

Revert to BUILD_TESTING; make inc fix optionally transitive

81e7370

Update README.md

bdb409e

Matthew-Whitlock requested a review from nmm0 October 12, 2023 21:05

nmm0 approved these changes Dec 12, 2023

View reviewed changes

Matthew-Whitlock merged commit cf22917 into sandialabs:master Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ci update, bugfixes #68

Ci update, bugfixes #68

Matthew-Whitlock commented Feb 23, 2023 •

edited

Loading

Matthew-Whitlock commented Apr 17, 2023 •

edited

Loading

nmm0 left a comment

nmm0 left a comment

Ci update, bugfixes #68

Ci update, bugfixes #68

Conversation

Matthew-Whitlock commented Feb 23, 2023 • edited Loading

Matthew-Whitlock commented Apr 17, 2023 • edited Loading

nmm0 left a comment

Choose a reason for hiding this comment

nmm0 left a comment

Choose a reason for hiding this comment

Matthew-Whitlock commented Feb 23, 2023 •

edited

Loading

Matthew-Whitlock commented Apr 17, 2023 •

edited

Loading