Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] WholeMemoryEmbedding gather operation does not work with integer-type of embeddings #69

Closed
chang-l opened this issue Sep 19, 2023 · 1 comment

Comments

@chang-l
Copy link
Contributor

chang-l commented Sep 19, 2023

🐛 Bug

When creating a WholeMemoryEmbedding instance with memory_type=distributed, the code would crash with dtype=int64 or int32 (working fine with fp32 and fp64).

To Reproduce

Minimum code to reproduce:

feat_size=111059956
feat_dim=1
dtype = torch.int64
node_feat_wm_embedding = wgth.create_embedding(
        global_comm,
        "distributed",
        "cpu",
        dtype,
        [feat_size, feat_dim],
)
sampled_nodes = 128000
input_nodes = torch.randint(0, feat_size, (sampled_nodes,), dtype=torch.int64, device=device)
x = node_feat_wm_embedding.gather(input_nodes)

<error messages and stack traces>

WholeMemory failure at file=/opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu line=55: File /opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu, line 55, it != ScatterFuncIntegerInt64_dispatch2_map->end() check failed.
WholeMemory failure at file=/opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu line=55: File /opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu, line 55, it != ScatterFuncIntegerInt64_dispatch2_map->end() check failed.
WholeMemory failure at file=/opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu line=55: File /opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu, line 55, it != ScatterFuncIntegerInt64_dispatch2_map->end() check failed.
WholeMemory failure at file=/opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu line=55: File /opt/rapids/wholegraph/cpp/src/wholememory_ops/functions/scatter_func_impl_integer_data_int64_indices.cu, line 55, it != ScatterFuncIntegerInt64_dispatch2_map->end() check failed.
[2023-09-19 20:43:35,753] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1203 closing signal SIGTERM
[2023-09-19 20:43:36,017] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 1202) of binary: /usr/bin/python

Environment

  • Version: 23.08
  • Source build with bash build.sh libwholegraph pylibwholegraph tests -v --allgpuarch

Additional notes:

It seems like a bug to me... For integer type of data, scatter func impl should dispatch from int types, instead of float types (HALF-FLOAT-DOUBLE), right?

@dongxuy04
Copy link
Contributor

@chang-l Thanks for helping pointing this bug, PR #81 will fix it.

rapids-bot bot pushed a commit that referenced this issue Oct 17, 2023
…date example (#81)

Some Updates:

- Add separate init
- Expose gather/scatter for WholeMemoryTensor
- Some updates on examples and flags
- Fix integer scatter bug (closes issue #69 ).

Authors:
  - https://github.com/dongxuy04

Approvers:
  - Brad Rees (https://github.com/BradReesWork)

URL: #81
@chang-l chang-l closed this as completed Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants