Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitely mention that the PyTorch easyblock needs updating when failing for this reason #3255

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Mar 13, 2024

(created using eb --new-pr)

See easybuilders/easybuild-easyconfigs#19666 (comment) for the motivation

@Flamefire
Copy link
Contributor Author

@casparvl I updated the EasyBlock after the discussion at #3255 (comment)

I also noticed and fixed 2 things:

  • If we can't count failures we only failed if we counted ANY failure
  • We can't count failures if a test aborted but the previous iteration showed a message that the EasyBlock needed updating

I moved the detection of "uncounted suites" into the parser function such that we can test it (manually).

@Flamefire Flamefire force-pushed the 20240313110424_new_pr_pytorch branch from cae8f4e to 20b60c3 Compare August 8, 2024 07:23
@Flamefire
Copy link
Contributor Author

Can this be merged? Working on this easyblock again and want to avoid conflicts and potentially use the improvements from here

Flamefire and others added 8 commits February 20, 2025 16:46
Easier to search for test by name
We have 3 error cases:
1. Some suites were terminated and hence don't have proper test output we can use
2. Unexpected output format not parsed correctly or missed. Maybe some terminated suites
   but the major point is that we need an EasyBlock update for the former.
3. We parsed more suites than the PyTorch summary output showed. Likely
   a bug in the EasyBlock being to greedy.

Diffrentiate those cases to not show a wrong message.
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

Build succeeded for 0 out of 1 (1 easyconfigs in total)
i7004 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7702 64-Core Processor (zen2), Python 3.8.17
See https://gist.github.com/Flamefire/5d45465aabf55d4d84a543be0af94893 for a full test report.

akesandgren
akesandgren previously approved these changes Feb 21, 2025
Copy link
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akesandgren
Copy link
Contributor

@Flamefire do you want to change anything else or should we merge as is?

The failed test report came in while I was reviewing :-)

It is not given that the test output parsing code is the issue.
There is no reasonable output if the test failed to start at all, e.g.
due to syntax errors.
@Flamefire
Copy link
Contributor Author

I added some further polishing, yes. It can be merged now. The test report was for an EasyConfig I made to run in ~40 minutes and fail some suites in different ways. It was meant to show the new output:

WARNING Found 3 individual tests that exited with an error: test_checkpointing_without_reentrant_detached_tensor_use_reentrant_False, test_excessive_thread_creation_warning, test_excessive_thread_creation_warning
Found 2 individual tests with failed assertions: test_fn_grad_linalg_det_singular_cpu_float64, test_memory_timeline

ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:126 in __init__): Failing because not all failed tests could be determined. The test accounting in the PyTorch EasyBlock needs updating!
Missing: test_sparse, test_sparse_csr
You can check the test failures (in the log) manually and if they are harmless, use --ignore-test-failures to make the test step pass.
2 test failures, 3 test errors (out of 6437):
Failed tests (suites/files):
	profiler/test_memory_profiler (32 total tests, failures=1)
	distributed/_tensor/test_dtensor_ops (637 total tests, skipped=36, expected failures=407, unexpected successes=1)
	test_autograd (554 total tests, errors=1, skipped=24, expected failures=1)
	test_dataloader (164 total tests, errors=2, skipped=15)
	test_ops_gradients (1 failed, 1892 passed, 3055 skipped, 42 xfailed, 1 warning, 2 rerun)
Could not count failed tests for the following test suites/files:
	test_sparse (Did not run properly)
	test_sparse_csr (Did not run properly) (at easybuild/easyblocks/pytorch.py:582 in test_step)

I updated that slightly because the missing tests are those that failed to start at all. In the (current) log this shows up as:

Running test_sparse_csr ... [2025-02-21 11:02:19.117097]
Executing ['/data/Python/3.10.4-GCCcore-11.3.0/bin/python', '-bb', 'test_sparse_csr.py', '-v'] ... [2025-02-21 11:02:19.117297]

Traceback (most recent call last):
  File "/dev/shm//pytorch-v2.0.1/test/test_sparse_csr.py", line 23, in <module>
    from test_sparse import CUSPARSE_SPMM_COMPLEX128_SUPPORTED
  File "/dev/shm/pytorch-v2.0.1/test/test_sparse.py", line 29, in <module>
    raise RuntimeError("Intended failure")
RuntimeError: Intended failure
FINISHED PRINTING LOG FILE of test_sparse_csr (/dev/shm/pytorch-v2.0.1/test/test-reports/test_sparse_csr_qph3b9p4.log)

So without the last commit there is a discrepancy: It shows (in this case correctly) that the tests failed to run properly but also that the test parsing needs updating which is not the case here: It needs a patch to fix the test. Usually this specific issue is caused by us applying a patch from a previous version that breaks the current version.

So I toned it down a bit. I hope it is clear enough now :-/

Copy link
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM

@akesandgren
Copy link
Contributor

Going in, thanks @Flamefire!

@akesandgren akesandgren merged commit 1d3d418 into easybuilders:develop Feb 21, 2025
41 checks passed
@Flamefire Flamefire deleted the 20240313110424_new_pr_pytorch branch February 21, 2025 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants