Explicitely mention that the PyTorch easyblock needs updating when failing for this reason #3255

Flamefire · 2024-03-13T10:04:26Z

(created using eb --new-pr)

See easybuilders/easybuild-easyconfigs#19666 (comment) for the motivation

easybuild/easyblocks/p/pytorch.py

Flamefire · 2024-05-14T15:17:31Z

@casparvl I updated the EasyBlock after the discussion at #3255 (comment)

I also noticed and fixed 2 things:

If we can't count failures we only failed if we counted ANY failure
We can't count failures if a test aborted but the previous iteration showed a message that the EasyBlock needed updating

I moved the detection of "uncounted suites" into the parser function such that we can test it (manually).

…iling for this reason

Differentiate between "outdated" EasyBlock and tests terminated by a signal

Flamefire · 2025-02-06T15:45:15Z

Can this be merged? Working on this easyblock again and want to avoid conflicts and potentially use the improvements from here

Easier to search for test by name

We have 3 error cases: 1. Some suites were terminated and hence don't have proper test output we can use 2. Unexpected output format not parsed correctly or missed. Maybe some terminated suites but the major point is that we need an EasyBlock update for the former. 3. We parsed more suites than the PyTorch summary output showed. Likely a bug in the EasyBlock being to greedy. Diffrentiate those cases to not show a wrong message.

Flamefire · 2025-02-21T09:57:23Z

Test report by @Flamefire

Overview of tested easyconfigs (in order)

FAIL (build issue) PyTorch-2.0.1-foss-2022atest-report.eb (partial log available at https://gist.github.com/Flamefire/9f058a380a9a4487d8cdc19246755e40)

Build succeeded for 0 out of 1 (1 easyconfigs in total)
i7004 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7702 64-Core Processor (zen2), Python 3.8.17
See https://gist.github.com/Flamefire/5d45465aabf55d4d84a543be0af94893 for a full test report.

akesandgren

LGTM

akesandgren · 2025-02-21T10:01:53Z

@Flamefire do you want to change anything else or should we merge as is?

The failed test report came in while I was reviewing :-)

It is not given that the test output parsing code is the issue. There is no reasonable output if the test failed to start at all, e.g. due to syntax errors.

Flamefire · 2025-02-21T10:15:20Z

I added some further polishing, yes. It can be merged now. The test report was for an EasyConfig I made to run in ~40 minutes and fail some suites in different ways. It was meant to show the new output:

WARNING Found 3 individual tests that exited with an error: test_checkpointing_without_reentrant_detached_tensor_use_reentrant_False, test_excessive_thread_creation_warning, test_excessive_thread_creation_warning
Found 2 individual tests with failed assertions: test_fn_grad_linalg_det_singular_cpu_float64, test_memory_timeline

ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:126 in __init__): Failing because not all failed tests could be determined. The test accounting in the PyTorch EasyBlock needs updating!
Missing: test_sparse, test_sparse_csr
You can check the test failures (in the log) manually and if they are harmless, use --ignore-test-failures to make the test step pass.
2 test failures, 3 test errors (out of 6437):
Failed tests (suites/files):
	profiler/test_memory_profiler (32 total tests, failures=1)
	distributed/_tensor/test_dtensor_ops (637 total tests, skipped=36, expected failures=407, unexpected successes=1)
	test_autograd (554 total tests, errors=1, skipped=24, expected failures=1)
	test_dataloader (164 total tests, errors=2, skipped=15)
	test_ops_gradients (1 failed, 1892 passed, 3055 skipped, 42 xfailed, 1 warning, 2 rerun)
Could not count failed tests for the following test suites/files:
	test_sparse (Did not run properly)
	test_sparse_csr (Did not run properly) (at easybuild/easyblocks/pytorch.py:582 in test_step)

I updated that slightly because the missing tests are those that failed to start at all. In the (current) log this shows up as:

Running test_sparse_csr ... [2025-02-21 11:02:19.117097]
Executing ['/data/Python/3.10.4-GCCcore-11.3.0/bin/python', '-bb', 'test_sparse_csr.py', '-v'] ... [2025-02-21 11:02:19.117297]

Traceback (most recent call last):
  File "/dev/shm//pytorch-v2.0.1/test/test_sparse_csr.py", line 23, in <module>
    from test_sparse import CUSPARSE_SPMM_COMPLEX128_SUPPORTED
  File "/dev/shm/pytorch-v2.0.1/test/test_sparse.py", line 29, in <module>
    raise RuntimeError("Intended failure")
RuntimeError: Intended failure
FINISHED PRINTING LOG FILE of test_sparse_csr (/dev/shm/pytorch-v2.0.1/test/test-reports/test_sparse_csr_qph3b9p4.log)

So without the last commit there is a discrepancy: It shows (in this case correctly) that the tests failed to run properly but also that the test parsing needs updating which is not the case here: It needs a patch to fix the test. Usually this specific issue is caused by us applying a patch from a previous version that breaks the current version.

So I toned it down a bit. I hope it is clear enough now :-/

akesandgren

Still LGTM

akesandgren · 2025-02-21T10:47:25Z

Going in, thanks @Flamefire!

Flamefire mentioned this pull request Mar 13, 2024

{ai}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1 easybuilders/easybuild-easyconfigs#19666

Merged

boegel added the enhancement label Mar 13, 2024

boegel added this to the 4.x milestone Mar 13, 2024

casparvl reviewed Mar 14, 2024

View reviewed changes

easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved

Flamefire mentioned this pull request May 15, 2024

fix finding of failed tests in output of PyTorch test step #2859

Open

Flamefire added 4 commits August 8, 2024 09:22

Explicitely mention that the PyTorch easyblock needs updating when fa…

d4afafd

…iling for this reason

Move the test-parsing from logfiles into own function

d44f74b

Refactor: Make distinction between failure_report and its base list

6cfef99

Always fail on uncounted test suites and improve reason

20b60c3

Differentiate between "outdated" EasyBlock and tests terminated by a signal

Flamefire force-pushed the 20240313110424_new_pr_pytorch branch from cae8f4e to 20b60c3 Compare August 8, 2024 07:23

Flamefire and others added 8 commits February 20, 2025 16:46

Merge branch 'easybuilders:develop' into 20240313110424_new_pr_pytorch

685db10

Use dataclass instead of NamedTuple

e5b0dc7

Small simplification of regex

b66420a

Rename test_result to parsed_test_result

aeba74b

Use dictionary for terminated_suites

09c525c

Easier to search for test by name

Assemble final report once

8bfb14e

Improve failure message for uncounted suites

f73302c

akesandgren previously approved these changes Feb 21, 2025

View reviewed changes

Be more careful with failed message

ad19427

It is not given that the test output parsing code is the issue. There is no reasonable output if the test failed to start at all, e.g. due to syntax errors.

Flamefire dismissed akesandgren’s stale review via ad19427 February 21, 2025 10:10

akesandgren approved these changes Feb 21, 2025

View reviewed changes

akesandgren merged commit 1d3d418 into easybuilders:develop Feb 21, 2025
41 checks passed

Flamefire deleted the 20240313110424_new_pr_pytorch branch February 21, 2025 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitely mention that the PyTorch easyblock needs updating when failing for this reason #3255

Explicitely mention that the PyTorch easyblock needs updating when failing for this reason #3255

Flamefire commented Mar 13, 2024 •

edited

Loading

Flamefire commented May 14, 2024

Flamefire commented Feb 6, 2025

Flamefire commented Feb 21, 2025

akesandgren left a comment

akesandgren commented Feb 21, 2025

Flamefire commented Feb 21, 2025

akesandgren left a comment

akesandgren commented Feb 21, 2025

Explicitely mention that the PyTorch easyblock needs updating when failing for this reason #3255

Explicitely mention that the PyTorch easyblock needs updating when failing for this reason #3255

Conversation

Flamefire commented Mar 13, 2024 • edited Loading

Flamefire commented May 14, 2024

Flamefire commented Feb 6, 2025

Flamefire commented Feb 21, 2025

Overview of tested easyconfigs (in order)

akesandgren left a comment

Choose a reason for hiding this comment

akesandgren commented Feb 21, 2025

Flamefire commented Feb 21, 2025

akesandgren left a comment

Choose a reason for hiding this comment

akesandgren commented Feb 21, 2025

Flamefire commented Mar 13, 2024 •

edited

Loading