Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mjmac/DAOS 16787 google 2.6 #15498

Closed
wants to merge 370 commits into from
Closed

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Nov 13, 2024

jolivier23 and others added 30 commits July 22, 2024 12:13
We added two metrics around the quota code but they
never made it into master branch.

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
- Change cart_ctl to no longer fail if pre-ping fails in daos env
- Set individual cmd pings to be 3 seconds
- Change cart_ctl invocation to use 'no sync' option, avoiding pinging
all ranks, including possibly dead ones

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
wenbaoxu@163.com found that pool_svc_rfcheck_ult may enter an infinite
loop due to -DER_NOTLEADERs when the PS leader is stepping down:

    src/rdb/rdb_raft.c:3146 rdb_raft_wait_applied() 564e1a52[1]: waiting
      for entry 0 to be applied
    src/container/srv_container.c:5050 ds_cont_rdb_iterate() 564e1a52
      container iter: rc -2008
    src/pool/srv_pool.c:6405 pool_svc_rfcheck_ult() 564e1a52 check rf
      with -2008 and retry
    [...]

This patch adds a check for whether the "sched" is canceled in
pool_svc_rfcheck_ult before sleeping and retrying, and fixes a nonsense
dss_sleep(0) call.

Signed-off-by: Li Wei <wei.g.li@intel.com>
1. Send DTX batched commit RPCs step by step
Currently, for each DTX batched commit operation, it will handle at
most 512 DTX entries that may generate DTX commit RPCs to thousands
of DAOS targets. We will not send out the batched RPCs all together,
instead, we will send them step by step. After each step, the logic
will yield and wait until replied, and then next batched RPCs. That
can avoid holding too much system resources for relative long time.
It is also helpful to reduce the whole system network peak load and
the pressure on related targets.

2. Cleanup stale DTX based on global RPC timeout
Originally, DTX cleanup logic will be triggered if the life for some
stale DTX exceeds the fixed threshold DTX_CLEANUP_THD_AGE_UP (90 sec)
that maybe smaller than global default RPC timeout, as to related DTX
refresh RPC for cleanup logic maybe send out too early before related
modification RPC(s) timeout. It increases network load unnecessarily.

Then we adjust the DTX cleanup threshold based on global default RPC
timeout value, and give related DTX leader sometime after default RPC
timeout to commit or abort the DTX. If the DTX is still prepared after
that, then trigger DTX cleanup to handle potential stale DTX entries.

3. Reorg DTX CoS logic
Reduce the RPCs caused by potential repeated DTX commit. More clear
names for DTX CoS API.

Signed-off-by: Fan Yong <fan.yong@intel.com>
A bug was limiting the number of ranks shown to the number of MS
replicas instead.

- Show all rank fabric URIs.
- Added MS ranks list to sys info.
- Add MS ranks to daos system query output.

Signed-off-by: Kris Jacque <kris.jacque@intel.com>
Co-authored-by: Cedric Koch-Hofer <94527853+knard-intel@users.noreply.github.com>
add MPICH and IMPI infos on how to set daos: prefix

Signed-off-by: Michael Hennecke <michael.hennecke@intel.com>
Makes it a little easier to query the layout

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Add support for a new daos fs chmod command to adjust the mode of a file.

Signed-off-by: Colin Howes <chowes@google.com>
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Co-authored-by: Colin Howes <chowes@google.com>
This patch addresses a corner case where a leadership change may
occur while a pool is in the Destroying state. During checkPools,
if the pool is in a Destroying state, the MS will attempt to
destroy the pool. The top level PoolDestroy method attempted to
wait for leadership step-up to finish, but was needed in order for
step-up to complete.

It is okay to skip the leadership check for PoolDestroy in this
case. Leadership was already checked at the beginning of checkPools.

Signed-off-by: Kris Jacque <kris.jacque@intel.com>
Allow the control client to report a version other than the
static value embedded at build time. Enables some nonstandard
use cases where Control API users will take responsibility
for version interoperability without code changes.

Change-Id: I0ce6ddbf2c742ce9c8dab9e21e37cd5ea8c5f5b3
Signed-off-by: Michael MacDonald <mjmac@google.com>
update pylint to 3.2.6

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
merge yamllint and clang-format into linting workflow so all lint checks
are grouped together.

Make yaml-lint required but clang-format optional until stable.

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
use math.inf to disable threshold checks in CI so the test can run
as a smoke test

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Get the targets and ranks from the config instead of hardcoding.
Also use self.random instead of direct random.

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Try to build with the SHA under test if possible in the
dfuse/daos_build.py test.

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
)

remove DataMoverTestBase.posix_local_test_paths in favor of local
references

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Perform basic system health checks from the client
perspective. Checks the following:

  * Client/Server versions
  * Key library versions and paths
  * Connected sytem information
  * Pool status for all pools to which the user
    has access
  * Container status for all containers in the
    checked pools

Change-Id: I9154ee7f3632996e0e67ad6f320874e1df2e0d23
Signed-off-by: Michael MacDonald <mjmac@google.com>
Config generate should not output auto calculated values
update ftest to not expect scm_size in config generate yaml output

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Ignore NotReplica errors when checking if enabled.

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
For VOS command test, add a visible iterator option so
one can manually verify that VOS iterates correctly for
that case.

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Added some more Go runtime suppressions. I've added counterparts for
all functions already on the list, as well as the false positives that
have been recently spotted.

Signed-off-by: Kris Jacque <kris.jacque@intel.com>
#14736)

After SysXS device is set to faulty, the device state transition may take a
while until its state becomes EVICTED. Wait for 10 sec before querying the
device state. If the state isn't EVICTED, query again.

Also set only one SysXS device faulty because there's no point of setting
second SysXS device to faulty. (based on the developer feedback).

Check whether any of the engines is down at the end of the test

Signed-off-by: Makito Kano <makito.kano@intel.com>
…#14819)

For EC object rebuild, some ext not exist on some shards, for this case
create VOS container when no record need to be rebuilt. To avoid
following IO cannot find container and fail at obj_ioc_init() ->
cont_child_lookup().
Another case is in cont_snap_update_one() create the vos cont if
non-exist.

Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
Update the current run_local() command to return an object similar to
run_remote() to allow them to be used interchangeably.

increase verify_perms.py timeout.

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Running `(daos|dmg) pool query-targets` with just a rank
argument should query all targets on that rank.

Signed-off-by: Michael MacDonald <mjmac@google.com>
When the LRU cache is performing eviction, new lookups should fail.
Currently, this logic is implemented on the caller’s side. Let's
move this logic to the DAOS LRU side to return DER_SHUTDOWN if LRU
is evicting and remove incorrect assertion.

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Lei Huang <lei.huang@intel.com>
Correct protobuf field names to be consistent within the pool.proto
file and remove meta-blob size references in MD-on-SSD phase-1 code.

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
- Instead of using protobuf map, represent the map as an array where the index == NUMA node ID.
- Always include domain in NUMA fabric map. Previously domain was missing when it was the same as the interface
name.

Signed-off-by: Kris Jacque <kris.jacque@intel.com>
When the change was made to allow partial overwrite
for rebuild, it broke delete such that delete would
remove the newest extent rather than the exact one
we requested.

Also, don't ignore errors when processing removals

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
NiuYawei and others added 27 commits November 4, 2024 21:02
* DAOS-13701: Memory bucket allocator API definition (#13152)

- New umem macros are exported to do the allocation within
  memory bucket. umem internally now calls the modified backend
  allocator routines with memory bucket id passed as argument.
- umem_get_mb_evictable() and dav_get_zone_evictable() are
  added to support allocator returning preferred zone to be
  used as evictable memory bucket for current allocations. Right
  now these routines always return zero.
- The dav heap runtime is cleaned up to make provision for
  memory bucket implementation.

* DAOS-13703 umem: umem cache APIs for phase II (#13138)

Four sets of umem cache APIs will be exported for md-on-ssd phase II:

1. Cache initialization & finalization
   - umem_cache_alloc()
   - umem_cache_free()

2. Cache map, load and pin
   - umem_cache_map();
   - umem_cache_load();
   - umem_cache_pin();
   - umem_cache_unpin();

3. Offset and memory address converting
   - umem_cache_off2ptr();
   - umem_cache_ptr2off();
  
4. Misc
   - umem_cache_commit();
   - umem_cache_reserve();

* DAOS-14491: Retain support for phase-1 DAV heap (#13158)

The phase-2 DAV allocator is placed under the subdirectory
src/common/dav_v2. This allocator is built as a standalone shared
library and linked to the libdaos_common_pmem library. 
The umem will now support one more mode DAOS_MD_BMEM_V2. Setting
this mode in umem instance will result in using phase-2 DAV allocator
interfaces.
  
* DAOS-15681 bio: store scm_sz in SMD (#14330)

In md-on-ssd phase 2, the scm_sz (VOS file size) could be smaller
than the meta_sz (meta blob size), then we need to store an extra
scm_sz in SMD, so that on engine start, this scm_sz could be
retrieved from SMD for VOS file re-creation.

To make the SMD compatible with pmem & md-on-ssd phase 1, a new
table named "meta_pool_ex" is introduced for storing scm_sz.

* DAOS-14422 control: Update pool create UX for MD-on-SSD phase2 (#14740)

Show MD-on-SSD specific output on pool create and add new syntax to
specify ratio between SSD capacity reserved for MD in new DAOS pool
and the (static) size of memory reserved for MD in the form of VOS
index files (previously held on SCM but now in tmpfs on ramdisk).
Memory-file size is now printed when creating a pool in MD-on--SSD
mode.

The new --{meta,data}-size params can be specified in decimal or
binary units e.g. GB or GiB and refer to per-rank allocations. These
manual size parameters are only for advanced use cases and in most
situations the --size (X%|XTB|XTiB) syntax is recommended when
creating a pool. --meta-size param is bytes to use for metadata on
SSD and --data-size is for data on SSD (similar to --nvme-size).

The new --mem-ratio param is specified as a percentage with up to two
decimal places precision. This defines the proportion of the metadata
capacity reserved on SSD (i.e. --meta-size) that will be used when
allocating the VOS-index (one blob and one memory file per target).

Enable MD-on-SSD phase2 pool creation requires envar
DAOS_MD_ON_SSD_MODE=3 to be set in server config file.

* DAOS-14317 vos: initial changes for the phase2 object pre-load (#15001)

- Introduced new durable format 'vos_obj_p2_df' for the md-on-ssd phase2
  object, at most 4 evict-able bucket IDs could be stored.

- Changed vos_obj_hold() & vos_obj_release() to pin or unpin object
  respectively.

- Changed the private data of VOS dkey/akey/value trees from 'vos_pool' to
  'vos_object', the private data will be used for allocating/reserving from
  the evict-able bucket.

- Move the vos_obj_hold() call from vos_update_end() to vos_update_begin()
  for the phase2 pool, reserve value from the object evict-able bucket.

* DAOS-14316 vos: object preload for GC (#15059)

- Use the reserved vos_gc_item.it_args to store 2 bucket IDs for
  GC_OBJ, GC_DKEY and GC_AKEY, so that GC drain will be able to tell the
  what buckets need be pinned by looking up bucket numbers stored in
  vos_obj_df.

- Once GC drain needs to pin a different bucket, it will have to commit
  current tx; unpin current bucket; pin required bucket; start new tx;

- Forge a dummy object as the private data for the btree opened by GC,
  so that the 'ti_destroy' hack could be removed.

- Store evict-able bucket ID persistently for newly created object, this
  was missed in prior PR.

* DAOS-14315 vos: Pin objects for DTX commit & CPD RPC (#15118)

Introduced two new VOS APIs vos_pin_objects() & vos_unpin_objects()
for pin or unpin objects. Changed DTX commit/abort & CPD RPC handler
code to ensure objects pinned before starting local transaction.

- Bug fix in vos_pmemobj_create(), the actual scm_size should be passed
   to bio_mc_create().
- Use vos_obj_acquire() instead of vos_obj_hold() in vos_update_begin() to
  avoid the complication of object ilog adding in ts_set. We could simplify it
  in future cleanup PRs.
- Handle concurrent object bucket alloting & loading.

* DAOS-16160 control: Update pool create --size % opt for MD-on-SSD p2 (#14957)

Update calculation of usable pool META and DATA component sizes for
MD-on-SSD phase-2 mode; when meta-blob-size > vos-file-size.

- Use mem-ratio when making NVMe size adjustments to calculate usable
  pool capacity from raw stats.
- Use mem-ratio when auto-sizing to determine META component from
  percentage of usable rank-RAM-disk capacity.
- Apportion cluster count reductions to SSDs based on number of
  assigned targets to take account of target striping across a tier.
- Fix pool query ftest.
- Improve test coverage for meta and rdb size calculations.

* DAOS-16763 common: Tunable to control max NEMB (#15422)

A new tunable, DAOS_MD_ON_SSD_NEMB_PCT is introuced, to define the
percentage of memory cache that non-evictable memory buckets can
expand to. This tunable will be read during pool creation and
persisted, ensuring that each time the pool is reopened,
it retains the value set during its creation.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Signed-off-by: Sherin T George <sherin-t.george@hpe.com>
Co-authored-by: Tom Nabarro <tom.nabarro@intel.com>
Co-authored-by: sherintg <sherin-t.george@hpe.com>
Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Invalid hole extent might be left by process_hole_ult(),
so let's skip it.

Signed-off-by: Di Wang <ddiwang@google.com>
This only serves to add confusion at this point.

Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>
Enable write access to the Security section of the Github project

Use GHA cache to avoid Trivy scan failures due to overuse of CVEs database results in database download failure
Upgrade trivy-action to version 0.28.0 where the caching mechanism is enabled by default.
Enable the debug option in Trivy to be prepared for detailed scan failure analysis

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
#14932)

libfabric loads libze_loader.so which calls zeInit(). We observed deadlock due to nested calls when daos_init() is called inside zeInit(). We intercept dlsym() and zeInit() to avoid calling daos_init() inside zeInit(). dlsym(RTLD_NEXT, ) checks returning address to determine caller's module. To maintain expected behavior of dlsym(RTLD_NEXT, ) with our interception, new_dlsym() is implemented with assembly code to use jmp instruction instead of call. dlsym() has been moved from libdl.so to libc.so since version 2.34.

Signed-off-by: Lei Huang <lei.huang@intel.com>
The patch contains the following improvements:

1. When VOS level logic returns -DER_TX_RESATRT, the object level RPC
   handler should set 'RESEND' flag then restart the transaction with
   newer epoch. Because dtx_abort() logic cannot guarantee all former
   prepared DTX entries (on all related participants) can be aborted,
   especially if the former one failed for some network trouble, that
   may cause restarted transaction hit -DER_TX_ID_REUSED unexpectedly.

2. Compare the epoch for DTX entries with the same transaction ID for
   distinguishing potential reused TX ID more accurately.

3. Add DTX entry into DTX CoS cache if cannot commit it synchronously.
   Then subsequent batched commit logic can handle it.

4. If server complains suspected TX ID reusing, then reports -EIO to
   related application instead of assertion on client.

5. Control DTX related warning message frequency to avoid log flood.

6. Collect more information when generate some error/warning message.

Signed-off-by: Fan Yong <fan.yong@intel.com>
Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>
#15420)

The object placement algorithm was changed by DAOS-16445. As a result,
data are written to targets more uniformly while the amount of
leftover data after container destroy/garbage collection in each target
remains the same. i.e., Data are written to more targets while the
cleanup method in each target hasn't been improved, which results in
higher aggregate leftover data.

To handle larger amount of leftover data in SCM, increase the threshold
to 1.5MB.

Signed-off-by: Makito Kano <makito.kano@intel.com>
In cases where the client telemetry has been manually
enabled, daos_metrics should be able to read it as
long as the client's PID is known and the user has
read access to the shared memory segment.

Moves the daos_metrics utility into the common daos
package for use from both server and client sides.

Signed-off-by: Michael MacDonald <mjmac@google.com>
When STATIC_FUSE=1 is set, the arm build on ubuntu fails because
it can't find libfuse3.a.  Just add the expected path to the
list of search paths.

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
This PR enhances the DDB functionality for CR purposes with
the following updates:

1. Pool Behavior Control:

Administrators can now control certain vos pool behaviors,
such as skipping vos pool loading or setting a vos pool to immutable  mode.

2. Manual Pool Shard Removal:

A new command ddb rm_pool <vos_pool> has been introduced,
allowing administrators to manually remove pool shards.

3. SPDK Environment Initialization Bug Fix:

Fixed an issue where spdk_env_init() would fail during reinitialization.

These updates aim to improve system flexibility and stability,
providing administrators with more robust management capabilities.

Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Create a active_inode struct and allocate it for all inodes which have more than
one open handle. This allows us to share state/caching data across open handles
easier and to better support concurrent readers. Future work here will improve
performance for concurrent readers when caching is used, and allow us to make
the in-memory inode struct smaller which will save memory.

Signed-off-by: Ashley Pittman ashley.m.pittman@intel.com
In the past few passing runs, the test had ~100 sec test time
remaining at the end with 600 sec timeout. This means the
test usually takes ~500 sec. Set the timeout to
normal test duration * 1.5 = 750 sec

Signed-off-by: Makito Kano <makito.kano@intel.com>
Bump version to 2.7.101

faults-enabled: false

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Update the Prometheus exporter to support passthrough histograms
from native DAOS telemetry format. Fixes a few bugs and inefficiencies
in the native histogram implementation.

Signed-off-by: Michael MacDonald <mjmac@google.com>
The evt recx trace is used for vos aggregation debugging, and it's currently
reset on akey iteration callback, but the akey iteration callback could be
skipped in some cases, for example, when evt aggregation hit an aborted recx,
it'll start over in evtree level without the recx trace reset, that could
lead to integer overflow on the 'int ap_trace_count'.

This patch moved the ap_trace_count reset to merge window open/close to ensure
the evt recx trace always being reset properly.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
…15418)

Update dmg storage query usage for MD-on-SSD P2

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
To minimize bucket eviction/load when iterating objects, vos_iterate_obj()
is introduced to iterate objects in bucket ID order instead of OI order.
The caller of vos_iterate_obj() needs to provide a filter callback to call
the vos_bkt_iter_skip() properly.

Applied the vos_iterate_obj() for EC & VOS aggregation.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Initialize checkpoint stats to zero.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Provide an inverse to the existing exclude_fabric_ifaces
directive. In some cases, a given environment will only have
a small number of valid interfaces, so it is simpler to
specify that rather than having to exclude all of the
invalid interfaces.

Signed-off-by: Michael MacDonald <mjmac@google.com>
Fix a few minor remaining issues in previous PR intercepting dlsym and zeInit.

Use D_ASPRINTF when possible
Remove unneeded newline, debugging output, and parentheses.

Signed-off-by: Lei Huang <lei.huang@intel.com>
Add a section on handling unavailable engines.

Signed-off-by: Li Wei <wei.g.li@intel.com>
store mem-ratio in control-plane pool-service
remove unused tgt_dev param from extend create reintegrate and create ranks ds api
bump DAOS_MGMT_VERSION 3->4

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Update some tests to use unique dfuse mount directory by letting the
framework generate one.

Remove mount_dir from run_ior_multiple_variants since it is no longer
needed and this level of fine control should be handled per test
ideally.

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
* Add a suppression for Go runtime function racefuncenter.
* Add suppression for rt0_go CGo malloc

Signed-off-by: Kris Jacque <kris.jacque@intel.com>
Copy link

Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/mjmac/DAOS

@mjmac mjmac closed this Nov 13, 2024
@mjmac mjmac deleted the mjmac/DAOS-16787-google-2.6 branch November 13, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.