Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16265 test: Fix erasurecode/rebuild_fio.py out of space #15020

Merged
merged 11 commits into from
Oct 17, 2024

Conversation

phender
Copy link
Contributor

@phender phender commented Aug 27, 2024

Prevent accumulating large server log files caused by temporarily
enabling the DEBUG log mask while creating or destroying pools.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: EcodFioRebuild
Skip-func-hw-test-large-md-on-ssd: false

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

The erasurecode/rebuild_fio.py test runs out of space in self.test_dir
due to the same path being used for the control metadata path in MD on
SSD mode.  The test log file is also quite large with 24 test variants.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: EcodFioRebuild
Skip-func-hw-test-large-md-on-ssd: false

Required-githooks: true

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Copy link

github-actions bot commented Aug 27, 2024

Ticket title is '[12-24]-./erasurecode/rebuild_fio.py:EcodFioRebuild.test_ec_online_rebuild_fio tests fail due to daos_server startup problem.'
Status is 'Reopened'
Labels: 'ci_master_weekly,md_on_ssd,weekly_test,scrubbed_2.8'
https://daosio.atlassian.net/browse/DAOS-16265

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15020/1/execution/node/962/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15020/2/execution/node/946/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15020/2/execution/node/962/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: EcodFioRebuild

Required-githooks: true

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: EcodFioRebuild

Required-githooks: true

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: EcodFioRebuild

Required-githooks: true

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: EcodFioRebuild

Required-githooks: true

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
@phender
Copy link
Contributor Author

phender commented Oct 11, 2024

Verified the change by reducing the threshold to 3% in https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15020/7/artifact/Functional%20Hardware%20Large/erasurecode/rebuild_fio.py/job.log:

2024-10-10 23:52:13,524 test             L0773 INFO | ----------------------------------------------------------------------------------------------------
2024-10-10 23:52:13,524 test             L0776 DEBUG| Common test directory (/var/tmp/daos_testing) contents (check > 3%):
2024-10-10 23:52:13,525 run_utils        L0470 DEBUG| Running on wolf-[304,318-324] with a 120 second timeout: df -h /var/tmp/daos_testing
2024-10-10 23:52:13,751 run_utils        L0336 DEBUG|   wolf-324 (rc=0):
2024-10-10 23:52:13,751 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-10 23:52:13,751 run_utils        L0341 DEBUG|     /dev/sda7        28G  272K   26G   1% /var/tmp
2024-10-10 23:52:13,751 run_utils        L0336 DEBUG|   wolf-304 (rc=0):
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     /dev/sda7        28G  8.3M   26G   1% /var/tmp
2024-10-10 23:52:13,752 run_utils        L0336 DEBUG|   wolf-319 (rc=0):
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     /dev/sda7        28G  460M   26G   2% /var/tmp
2024-10-10 23:52:13,752 run_utils        L0336 DEBUG|   wolf-318 (rc=0):
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     /dev/sda7        28G  1.3G   25G   5% /var/tmp
2024-10-10 23:52:13,752 run_utils        L0336 DEBUG|   wolf-[322-323] (rc=0):
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     /dev/sda7        28G   68K   26G   1% /var/tmp
2024-10-10 23:52:13,752 run_utils        L0336 DEBUG|   wolf-320 (rc=0):
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     /dev/sda7        28G  763M   26G   3% /var/tmp
2024-10-10 23:52:13,752 run_utils        L0336 DEBUG|   wolf-321 (rc=0):
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-10 23:52:13,752 run_utils        L0341 DEBUG|     /dev/sda7        28G  707M   26G   3% /var/tmp
2024-10-10 23:52:13,753 run_utils        L0470 DEBUG| Running on wolf-318 with a 120 second timeout: du -sh /var/tmp/daos_testing/*
2024-10-10 23:52:13,940 run_utils        L0336 DEBUG|   wolf-318 (rc=0):
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/cart_logs
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     8.0K	/var/tmp/daos_testing/configs
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     32K	/var/tmp/daos_testing/daosCA
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/daos_configs
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/daos_dumps
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/daos_logs
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/stacktraces
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/test_ec_online_rebuild_fio
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     628K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_control.log
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     148K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_0.log.106004
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     138M	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_0.log.112564
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     80K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_0.log.119108
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     84K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_0.log.125555
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     76K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_0.log.131884
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     640M	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_0.log.138421
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     80K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_0.log.92584
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     88K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_0.log.99085
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     96K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_1.log.105816
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     148M	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_1.log.112373
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     68K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_1.log.118919
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     56K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_1.log.125366
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     84K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_1.log.131697
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     325M	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_1.log.138233
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     60K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_1.log.92395
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     120K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_1.log.98898
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     292K	/var/tmp/daos_testing/test_ec_online_rebuild_fio_daos_server_helper.log
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/user
2024-10-10 23:52:13,941 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/valgrind_logs
2024-10-10 23:52:13,941 test             L0788 INFO | ----------------------------------------------------------------------------------------------------

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: EcodFioRebuild
Skip-func-hw-test-large-md-on-ssd: false

Required-githooks: true

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
@phender phender marked this pull request as ready for review October 11, 2024 14:59
@phender phender requested review from a team as code owners October 11, 2024 14:59
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just looks like debugging. How is erasurecode/rebuild_fio.py being fixed?

@phender
Copy link
Contributor Author

phender commented Oct 14, 2024

This just looks like debugging. How is erasurecode/rebuild_fio.py being fixed?

This is optimizing code we've already implemented to debug an erasurecode/rebuild_fio.py issue where we would run out of space in the testing directory. The optimization is to only provide detail about what files are consuming space on nodes that exceed the threshold instead of all the hosts. The original problem is no longer being seen. In fact, even in the most recent https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15020/8/artifact/Functional%20Hardware%20Large%20MD%20on%20SSD/erasurecode/rebuild_fio.py/job.log run the highest use percentage is 7% by the 24th test variant.

It also adds an option to adjust the threshold via the test yaml (or extra test yaml). One additional option we could enable is to completely bypass the check if the max_test_dir_usage_check is set to 0.

daltonbohning
daltonbohning previously approved these changes Oct 14, 2024
Prevent accumulating large server log files caused by temporarily
enbaling the DEBUG log mask while creating or destroying pools.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: EcodFioRebuild EcodOnlineMultFail
Skip-func-hw-test-large-md-on-ssd: false

Required-githooks: true

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
@phender
Copy link
Contributor Author

phender commented Oct 17, 2024

@phender
Copy link
Contributor Author

phender commented Oct 17, 2024

This just looks like debugging. How is erasurecode/rebuild_fio.py being fixed?

This is optimizing code we've already implemented to debug an erasurecode/rebuild_fio.py issue where we would run out of space in the testing directory. The optimization is to only provide detail about what files are consuming space on nodes that exceed the threshold instead of all the hosts. The original problem is no longer being seen. In fact, even in the most recent https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15020/8/artifact/Functional%20Hardware%20Large%20MD%20on%20SSD/erasurecode/rebuild_fio.py/job.log run the highest use percentage is 7% by the 24th test variant.

It also adds an option to adjust the threshold via the test yaml (or extra test yaml). One additional option we could enable is to completely bypass the check if the max_test_dir_usage_check is set to 0.

Now a fix is included to disable enabling the DEBUG log mask when crearting/destroying pools.

@phender phender requested a review from a team October 17, 2024 14:47
@phender phender added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Oct 17, 2024
@daltonbohning daltonbohning merged commit 8f70ea0 into master Oct 17, 2024
46 of 48 checks passed
@daltonbohning daltonbohning deleted the pahender/DAOS-16265 branch October 17, 2024 20:47
phender added a commit that referenced this pull request Oct 17, 2024
Prevent accumulating large server log files caused by temporarily
enabling the DEBUG log mask while creating or destroying pools.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: EcodFioRebuild EcodOnlineMultFail
Skip-func-hw-test-large-md-on-ssd: false

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
phender added a commit that referenced this pull request Oct 21, 2024
…#15340)

Prevent accumulating large server log files caused by temporarily
enabling the DEBUG log mask while creating or destroying pools.

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.
Development

Successfully merging this pull request may close these issues.

4 participants