Skip to content

Conversation

TheAssembler1
Copy link
Collaborator

@TheAssembler1 TheAssembler1 commented Jul 17, 2025

Prevents segmentation fault that occurs when initialization of mercury fails.

[18:09:46.671414] [ERROR] [pdc_server.c:837] PDC_SERVER[0]: Error with HG_Init()
[18:09:46.671422] [ERROR] [pdc_server.c:2164] PDC_SERVER[0]: Error with PDC_Server_init
[ta1-pc:137951:0:137951] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
==== backtrace (tid: 137951) ====
 0  /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2ec) [0x741926849f2c]
 1  /lib/x86_64-linux-gnu/libucs.so.0(+0x3530d) [0x74192684b30d]
 2  /lib/x86_64-linux-gnu/libucs.so.0(+0x35619) [0x74192684b619]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x458d0) [0x7419270458d0]
 4  /home/ta1/src/workspace/source/pdc/build/bin/libpdc_commons.so(hash_table_num_entries+0x10) [0x741927453de1]
 5  /home/ta1/src/workspace/source/pdc/build/bin/libpdc_server_lib.so(PDC_Server_metadata_duplicate_check+0x58) [0x7419274b1a6a]
 6  /home/ta1/src/workspace/source/pdc/build/bin/libpdc_server_lib.so(PDC_Server_finalize+0x4d) [0x7419274a7cdd]
 7  /home/ta1/src/workspace/source/pdc/build/bin/libpdc_server_lib.so(server_run+0x2e9) [0x7419274abd10]
 8  ./pdc_server(main+0x24) [0x58ac12f6b8fe]
 9  /lib/x86_64-linux-gnu/libc.so.6(+0x2a578) [0x74192702a578]
10  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x74192702a63b]
11  ./pdc_server(_start+0x25) [0x58ac12f6a1e5]
=================================

Here is the first seg fault backtrace:

Thread 1 "pdc_server" received signal SIGSEGV, Segmentation fault.
0x00007ffff7edfde1 in hash_table_num_entries (hash_table=0x0)
    at /home/ta1/src/workspace/source/pdc/src/commons/collections/pdc_hash_table.c:541
541	    FUNC_LEAVE(hash_table->entries);
(gdb) bt
#0  0x00007ffff7edfde1 in hash_table_num_entries (hash_table=0x0)
    at /home/ta1/src/workspace/source/pdc/src/commons/collections/pdc_hash_table.c:541
#1  0x00007ffff7f3da6a in PDC_Server_metadata_duplicate_check () at /home/ta1/src/workspace/source/pdc/src/server/pdc_server_metadata.c:1353
#2  0x00007ffff7f33cdd in PDC_Server_finalize () at /home/ta1/src/workspace/source/pdc/src/server/pdc_server.c:1022
#3  0x00007ffff7f37d10 in server_run (argc=1, argv=0x7fffffffdc08) at /home/ta1/src/workspace/source/pdc/src/server/pdc_server.c:2313
#4  0x00005555555578fe in main (argc=1, argv=0x7fffffffdc08) at /home/ta1/src/workspace/source/pdc/src/server/pdc_server_main.c:8

After fixing the first seg fault there was a second seg fault backtrace later in the codepath:

Thread 1 "pdc_server" received signal SIGSEGV, Segmentation fault.
0x00007ffff7edfeb1 in hash_table_iterate (hash_table=0x0, iterator=0x7fffffffda40)
    at /home/ta1/src/workspace/source/pdc/src/commons/collections/pdc_hash_table.c:563
563	    for (chain = 0; chain < hash_table->table_size; ++chain) {
(gdb) br
Breakpoint 1 at 0x7ffff7edfeb1: file /home/ta1/src/workspace/source/pdc/src/commons/collections/pdc_hash_table.c, line 563.
(gdb) bt
#0  0x00007ffff7edfeb1 in hash_table_iterate (hash_table=0x0, iterator=0x7fffffffda40)
    at /home/ta1/src/workspace/source/pdc/src/commons/collections/pdc_hash_table.c:563
#1  0x00007ffff7f3dc00 in PDC_Server_metadata_duplicate_check () at /home/ta1/src/workspace/source/pdc/src/server/pdc_server_metadata.c:1370
#2  0x00007ffff7f33cdd in PDC_Server_finalize () at /home/ta1/src/workspace/source/pdc/src/server/pdc_server.c:1022
#3  0x00007ffff7f37d10 in server_run (argc=1, argv=0x7fffffffdc08) at /home/ta1/src/workspace/source/pdc/src/server/pdc_server.c:2313
#4  0x00005555555578fe in main (argc=1, argv=0x7fffffffdc08) at /home/ta1/src/workspace/source/pdc/src/server/pdc_server_main.c:8

We now log warnings for these NULL pointers and exit without seg faulting:

~/src/workspace/source/pdc/build/bin ~/src/workspace/source/pdc
[INFO] PDC_SERVER[0]: PDC_DEBUG set to 1
[INFO] PDC_SERVER[0]: Using [./pdc_tmp/] as tmp dir, 1 OSTs, 1 OSTs per data file, 0% to BB
[INFO] PDC_SERVER[0]: Environment variable HG_TRANSPORT was NOT set
[INFO] PDC_SERVER[0]: Environment variable HG_HOST was NOT set
[INFO] PDC_SERVER[0]: Connection string: ofi+tcp://ta1-pc:7000
# [6336.845375] mercury->fatal: [error] /home/ta1/src/workspace/source/mercury/src/na/na_ofi.c:2832
 # na_ofi_verify_info(): No provider found for "tcp;ofi_rxm" provider on domain "ta1-pc"
[17:53:40.706707] [ERROR] [pdc_server.c:837] PDC_SERVER[0]: Error with HG_Init()
[17:53:40.706713] [ERROR] [pdc_server.c:2164] PDC_SERVER[0]: Error with PDC_Server_init
[WARNING] PDC_CLIENT[0]: hash_table was NULL
[INFO] PDC_SERVER[0]: Bloom filter says maybe 0 times out of 0
[INFO] PDC_SERVER[0]: Metadata duplicate check with 0 hash entries
[WARNING] PDC_CLIENT[0]: hash_table was NULL
[INFO] PDC_SERVER[0]:   ...No duplicates found
[17:53:40.706777] [ERROR] [pdc_server.c:990] PDC_SERVER[0]: pdc_remote_server_info_g was NULL
[17:53:40.706782] [ERROR] [pdc_server.c:1047] PDC_SERVER[0]: Error with PDC_Server_destroy_client_info

@TheAssembler1 TheAssembler1 self-assigned this Jul 17, 2025
@TheAssembler1 TheAssembler1 requested a review from a team as a code owner July 17, 2025 23:10
@TheAssembler1 TheAssembler1 requested review from jeanbez and houjun July 18, 2025 00:57
@jeanbez jeanbez deployed to external July 19, 2025 21:48 — with GitHub Actions Active
@jeanbez jeanbez added the type: bug Something isn't working label Jul 21, 2025
@jeanbez jeanbez merged commit 797833d into hpc-io:develop Jul 21, 2025
9 checks passed
jeanbez added a commit that referenced this pull request Jul 21, 2025
* Add pdc_logger.h to installation (#245)

* sync with gitlab (#248)

* Fix restart issue (#228)

* Fix cache flush (#226)

* Fix a thread race issue that may cause memory error when larger than cache max size data is transferred

* Add a test that writes more data than server cache size

* Fix CI run command

* Fix restart issue

* Update nersc.yml (#238)

* Since PDCinit returns a uint64_t, 0 should indicate failure (#233)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Check the return value of `PDC_Client_init` in `PDC_init` (#230)

* Check that return value of PDC_Client_init in PDC_init

* Change return to 0

This will make is simpler when merging #233 (comment)

---------

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Change `printf` to PDC logger (#232)

* Changed all printf to use pdc logger

Also removed large blocks of comments and chanegd the pdc logger
to print the file name, function, and line number.

* Change typo of LOG_INFO to LOG_ERROR

* Correct grammar from fail -> failed

* update grammer succesfully close -> successfully closed

* switch type of LOG_INFO to LOG_ERROR

* Add logging docs and fix some LOG_INFO->LOG_JUST_PRINT

* update clang formatting

---------

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Malloc correct size for pdc_obj_metadata_pkg (#237)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* PDCregion_transfer_create validate client buf, local region, and remote regions (#236)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

---------

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>
Co-authored-by: Noah Lewis <47840925+TheAssembler1@users.noreply.github.com>

* Fix return metadata dtype (#246)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Region info transfer struct type and helper functions (#247)

* Fix cache flush (#226)

* Fix a thread race issue that may cause memory error when larger than cache max size data is transferred

* Add a test that writes more data than server cache size

* Fix CI run command

* checkpoint

* Switch variables such as count_0, start_0, and size0... to arrays

This will reduce code duplication, reduce bugs, and make it easier
to switch to support n-dimnesional data.

* clang format

* checkpoint

* created better function names and documentation

* remove

* Committing clang-format changes

* clang format

* remove file

* change for use helper function

* fix bug with incorrect helper function call

---------

Co-authored-by: Houjun Tang <htang4@lbl.gov>
Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Fix issues with PDC tools (#249)

* Fix issues with PDC tools

* Correct LOG_ERROR to LOG_INFO

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Fix printing in `PGOTO_ERROR` and `PGOTO_ERROR_VOID` (#250)

Print new line by default in `PGOTO_ERROR` and `PGOTO_ERROR_VOID`

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Group Tests Into Folders (#252)

* Fix cache flush (#226)

* Fix a thread race issue that may cause memory error when larger than cache max size data is transferred

* Add a test that writes more data than server cache size

* Fix CI run command

* Grouped commons tests into folders

This commit also changes the src/tests/CmakeLists.txt to build tests
within their new folders

* add deprecated folder remove buf_map folder

* Update run_multiple_mpi_test.sh

* Update dependencies-macos.sh

* Update dependencies-macos.sh

---------

Co-authored-by: Houjun Tang <htang4@lbl.gov>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>
Co-authored-by: Jean Luca Bez <jeanlucabez@gmail.com>

* Return the same obj_id if the obj is just created or already opened (#254)

* Return the same obj_id if the obj is just created or already opened

* Committing clang-format changes

* Update doc

* Update dependencies-macos.sh

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* add option to choose interface (#255)

* add option to connect to a given network interface
* Committing clang-format changes
* fix conflict
* include header
* enable output on failure

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>

* Fix multithreading compilation (#259)

* fix multhreading compilation

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Fix segmentation fault of calling `PDCobj_create_mpi` twice with duplicate object name (#262)

* Validate sucess of PDC_obj_create and PDC_find_id in PDCobj_create_mpi

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Use `PDC_malloc`, `PDC_free`, `PDC_calloc`, and `PDC_realloc` (#260)

* checkpoint

* replace free with PDC_free and calloc with PDC_calloc

* Committing clang-format changes

* fix more mallocs to PDC_malloc

* more PDC_free fixes

* Committing clang-format changes

* Update ubuntu-cache.yml

* remove eno1

* fix realloc

* Committing clang-format changes

* Update ubuntu-no-cache.yaml

* Fix several bugs with error checking with object dim allocation

* Committing clang-format changes

* fix bug

* Committing clang-format changes

* Update ubuntu-no-cache.yaml

* Update ubuntu-cache.yml

* Set default value of ndim to 1 in PDCprop_create when using PDC_OBJ_CREATE

* Committing clang-format changes

* Malloc when defaulting to ndim size 1.
Only free hostname when we PDC_malloc the memory
because pointers returned by getenv are not malloced
and could point to static memory.

* Committing clang-format changes

* Update README.md

minor change to trigger the pipeline

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>
Co-authored-by: Jean Luca Bez <jeanlucabez@gmail.com>

* Fix Sphinx documentation errors and warnings (#265)

* Fix all sphinx warnings and errors. Removed repeat declarations of functions.

* Committing clang-format changes

* remove def of EXTENSION_MAPPING

* gitignore for docs and fix c structs

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Replace `docs/README.md` -> steps to build docs (#268)

* Replace docs/README.md -> steps to build docs

* Update README.md

---------

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Use `FUNC_ENTER` and `FUNC_LEAVE` (#270)

* use func enter and func leave in all functions

* Committing clang-format changes

* fix infinite recursion between memory managment, hash table, and per function timing

* Committing clang-format changes

* add profiling to CI

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* New test macros and code cleanup (#261)

* checkpoint

* Committing clang-format changes

* some tests

* Committing clang-format changes

* checkpoint

* open_obj uses new test macros

* Committing clang-format changes

* read_obj uses TASSERT

* read_obj uses TASSERT

* Committing clang-format changes

* cont_del and cont_getid use test macros

* convert more tests to use macros

* convert more tests to macros

* Committing clang-format changes

* Committing clang-format changes

* clang format

* use test helper in cont_info and cont_add_del

* more tests use macros

* Committing clang-format changes

* use tests macros in more tests

* use PGOTO* macros instead of goto

* clang format

* more log fixes

* logging cleanup and more usage of test macros

* Committing clang-format changes

* clang format and fix CMakeLists for tests

* use tests macros in transfer overlap 2D/3D

* use TASSERT in more tests

* Committing clang-format changes

* use test asserts

* all tests on the CI use TASSERT

* fix printing and newlines in tests

* print time, file name, function name, and line number in debug print

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>

* Tests logging typo fix (#273)

* Fixed logging typos

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>

* Rename pdc_server.exe to pdc_server for consistency (#275)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Update vpicio_mts.c (#276)

Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Client Propogate `HG_Finalize` error on `PDCclose` (#263)

* all but 4 close errors are fixed

* Committing clang-format changes

* client side HG_Finalize now passes on serial tests

* Committing clang-format changes

* cleanup

* Committing clang-format changes

* Update pdc_region_transfer.c

* free bulk handles during region transfer close

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>

* Standardize ID Lookup Null Checks and Error Handling (#281)

* cleanup finding id's

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>

* Obj open fix (#279)

* Fix seg fault for PDCobj_open on non-existent object

* Committing clang-format changes

* Remove log from NULL check

* Log message when object metadata isn't found.

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Fix multithread (#274)

* move hash table mutex to hashtable source filse

* Committing clang-format changes

* add multithread compile test

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

* Fix seg fault when mercury initialization fails (#283)

* check for NULL paramterse in hash table

* Committing clang-format changes

---------

Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>

---------

Co-authored-by: Noah Lewis <47840925+TheAssembler1@users.noreply.github.com>
Co-authored-by: Houjun Tang <htang4@lbl.gov>
Co-authored-by: github-actions <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants