[Question] Does HugeCtr support H800 GPU？ #414

sparkling9809 · 2023-08-11T09:30:32Z

I run the embedding_test in HugeCtr on H800, but it failed, the exception follow is :

root@jupyuterlab-nb-1691543551529-ddf9dcb96-jr455:/usr/local/hugectr/bin# ./embedding_test
Running main() from /hugectr/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 278 tests from 28 test suites.
[----------] Global test environment set-up.
[----------] 28 tests from distributed_sparse_embedding_hash_test
[ RUN ] distributed_sparse_embedding_hash_test.fp32_sgd_1gpu
MpiInitService: MPI was already initialized by another (non-HugeCTR) mechanism.
[HCTR][09:27:01.919][INFO][RK0][main]: Global seed is 1544237699
[HCTR][09:27:01.994][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 1
[HCTR][09:27:02.470][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][09:27:02.470][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 76.5250
[HCTR][09:27:02.470][INFO][RK0][main]: Start all2all warmup
[HCTR][09:27:02.471][INFO][RK0][main]: End all2all warmup
[HCTR][09:27:02.472][INFO][RK0][main]: ./data_reader_test_data/temp_dataset_0.data
[HCTR][09:27:02.757][INFO][RK0][main]: train_file_list.txt done!
[HCTR][09:27:02.757][INFO][RK0][main]: ./data_reader_test_data exist
[HCTR][09:27:02.757][INFO][RK0][main]: ./data_reader_test_data/temp_dataset_0.data
[HCTR][09:27:02.828][INFO][RK0][main]: test_file_list.txt done!
[HCTR][09:27:02.828][DEBUG][RK0][main]: [device 0] allocating 0.0012 GB, available 76.2593
[HCTR][09:27:02.828][DEBUG][RK0][main]: [device 0] allocating 0.0030 GB, available 76.2554
[HCTR][09:27:03.179][INFO][RK0][main]: max_vocabulary_size_per_gpu_=100000
[HCTR][09:27:03.184][ERROR][RK0][main]: CUDA RT call "cudaGetLastError()" in line 341 of file /hugectr/HugeCTR/include/hashtable/cudf/concurrent_unordered_map.cuh failed with no kernel image is available for execution on the device (209).
root@jupyuterlab-nb-1691543551529-ddf9dcb96-jr455:/usr/local/hugectr/bin#

the cuda version: 12.2
HugeCtr docker image : Merlin-hugectr:23.02

EmmaQiaoCh · 2023-08-15T06:12:53Z

Hi, thanks for trying HugeCTR.
Could you use our latest image 23.06? Thanks.

sparkling9809 · 2023-08-15T07:13:11Z

Yes. The problem above has solved when I changed the image version to 23.06;
But when I run the trainning on multiple H800 gpu， there is a new problem:

[HCTR][06:01:46.761][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/2.txt--------------------
[HCTR][06:01:46.826][INFO][RK0][main]: Preparing embedding table for next pass
[HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
[HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
terminate called after throwing an instance of 'cudf::fatal_cuda_error'
what(): Fatal CUDA error encountered at: /opt/rapids/src/cudf/cpp/include/cudf/detail/utilities/pinned_allocator.hpp:170: 700 cudaErrorIllegalAddress an illegal memory access was encountered
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] *** Process received signal ***
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal: Aborted (6)
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal code: (-6)
Traceback (most recent call last):
File "dcn_init_train.py", line 175, in

I found the method in hugectr readme doc:

NOTE: HugeCTR uses NCCL to share data between ranks, and NCCL may requires shared memory for IPC and pinned (page-locked) system memory resources. It is recommended that you increase these resources by issuing the following options in the docker run command.

-shm-size=1g -ulimit memlock=-1

I have tried the method, but the problem doesn't disappear.

sparkling9809 · 2023-08-21T09:01:32Z

Is there any progresses for this question?

shijieliu · 2023-08-22T00:28:55Z

hi @sparkling9809 which training script are you using? From the log I can tell you are trying to use Embedding Training Cache, is this expected?

If you want to try some sample and not require Embedding Training Cache, we advise you to try EmbeddingCollection, which is our latest embedding implementation and old ones will be deprecated in the future. Here is the doc and sample for reference.

sparkling9809 · 2023-09-05T06:05:29Z

hi @sparkling9809 which training script are you using? From the log I can tell you are trying to use Embedding Training Cache, is this expected?

If you want to try some sample and not require Embedding Training Cache, we advise you to try EmbeddingCollection, which is our latest embedding implementation and old ones will be deprecated in the future. Here is the doc and sample for reference.

Thanks for your reply!

The script for trainnign as follows:



import argparse
import hugectr
from mpi4py import MPI
import time
from tools.utils import Log
logger = Log(__name__).getlog()

arg_parser = argparse.ArgumentParser(description="model train")
arg_parser.add_argument("--features_num", type=int, required=True)
arg_parser.add_argument("--check", type=str, required=True)
args = arg_parser.parse_args()
total_num = args.features_num*16


all_solt = [10000,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            50,50,110000000,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            10000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
# for i in range(len(all_solt)):
#     if all_solt[i] < 1:
#         all_solt[i] = 10000
# all_solt[-2] = 0
All_Solt = all_solt[:args.features_num]
# All_Solt = all_solt
if args.check == "solo":
    # Source = ["./testdata0717/train_solo/0.txt", "./testdata0717/train_solo/1.txt", "./testdata0717/train_solo/2.txt"]
    # Keyset = ["./testdata0717/train_solo/0.keyset", "./testdata0717/train_solo/1.keyset", "./testdata0717/train_solo/2.keyset"]
    # Source = ["./testdata0717/train_all_mini/0.txt"]
    # Keyset = ["./testdata0717/train_all_mini/all.keyset"]
    # Source = ["/home/workspace/hy/tmp/"+str(i)+".txt" for i in range(3)]
    # Keyset = ["/home/workspace/hy/tmp/" + str(i) + ".keyset" for i in range(3)]
    Source=["/root/91feature_data/91_keyset/"+str(i)+".txt" for i in range(3)]
    Keyset = ["/root/91feature_data/91_keyset/" + str(i) + ".keyset" for i in range(3) ]
elif args.check == "all":
    # Source = ["./testdata0717/train_all/all.txt"]
    # Keyset = ["./testdata0717/train_all/all.keyset"]
    Source = ["/root/91_keyset/0.txt"]
    Keyset = ["/root/91_keyset/0.keyset"]
else:
    raise ValueError("check类型错误, 请输入 solo 或 all")
# logger.info(f"特征个数为: {args.features_num}")
    
solver = hugectr.CreateSolver(model_name = "wd2kw_seq",
                              max_eval_batches = 5000,
                              batchsize_eval = 36000,
                              # batchsize = 10240,
                              batchsize = 36000,
                              #batchsize = 1000,
                              lr = 0.001, 
                              vvgpu = [[0,1,2]],
                              i64_input_key = True,
                              use_mixed_precision = False,
                              repeat_dataset = False,
                              use_cuda_graph = True
                              # kafka_brockers = "10.68.225.168:9092,10.68.226.229:9092,10.68.227.181:9092"
                             )

reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = Source, keyset = Keyset,
                                  eval_source="/root/91feature_data/91_keyset/eval.txt",
                                  num_workers=30,
                                  slot_size_array=All_Solt,
                                  check_type = hugectr.Check_t.Sum)
# reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
#                                   source = Source, keyset = Keyset,
#                                   eval_source="/home/workspace/hy/tmp/0.txt",
#                                   num_workers=30,
#                                   slot_size_array=All_Solt,
#                                   check_type = hugectr.Check_t.Sum)


optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)

etc = hugectr.CreateETC(ps_types = [hugectr.TrainPSType_t.Staged],
                        sparse_models = ["/root/wd2kw_seq_0_sparse_model"],\
                        local_paths = ["/root/"])

model = hugectr.Model(solver, reader, optimizer, etc)
model.add(hugectr.Input(label_dim = 1, label_name = "if_click",
                        dense_dim = 0, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("data1", 1, False, args.features_num)]))

model.add(
    hugectr.SparseEmbedding(
        embedding_type=hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
        workspace_size_per_gpu_in_mb=26700,
        embedding_vec_size=16,
        combiner="sum",
        sparse_embedding_name="sparse_embedding1",
        bottom_name="data1",
        optimizer=optimizer,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Reshape,
        bottom_names=["sparse_embedding1"],
        top_names=["reshape1"],
        leading_dim=total_num,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.MultiCross,
        bottom_names=["reshape1"],
        top_names=["multicross1"],
        num_layers=6,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["reshape1"],
        top_names=["fc1"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc1"], top_names=["relu1"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu1"],
        top_names=["dropout1"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout1"],
        top_names=["fc2"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc2"], top_names=["relu2"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu2"],
        top_names=["dropout2"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Concat,
        bottom_names=["dropout2", "multicross1"],
        top_names=["concat2"],
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["concat2"],
        top_names=["fc3"],
        num_output=1,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.BinaryCrossEntropyLoss,
        bottom_names=["fc3", "if_click"],
        top_names=["loss"],
    )
)

model.compile()
model.summary()
model.graph_to_json(graph_config_file = "wd2kw_seq.json")
#model.save_params_to_files("wdl")
model.fit(num_epochs = 1, display = 500, eval_interval = 100)

model.save_params_to_files("/root/wd2kw_seq")
s_to_files("/root/wd2kw_seq")

The script runs ok on 8 H800 GPU in single machine. But there is something wrong when the source include files greater than 3.

reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = Source,      // When the number of files included in source greater than 3, the wrong will happen!
                                   keyset = Keyset,     
                                  eval_source="/root/91feature_data/91_keyset/eval.txt",
                                  num_workers=30,
                                  slot_size_array=All_Solt,
                                  check_type = hugectr.Check_t.Sum)

the exception as follows:

[HCTR][06:01:46.761][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/2.txt--------------------
[HCTR][06:01:46.826][INFO][RK0][main]: Preparing embedding table for next pass
[HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
[HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
terminate called after throwing an instance of 'cudf::fatal_cuda_error'
what(): Fatal CUDA error encountered at: /opt/rapids/src/cudf/cpp/include/cudf/detail/utilities/pinned_allocator.hpp:170: 700 cudaErrorIllegalAddress an illegal memory access was encountered
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] *** Process received signal ***
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal: Aborted (6)
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal code: (-6)
Traceback (most recent call last):
File "dcn_init_train.py", line 175,

JacoCheung · 2023-09-18T14:48:41Z

closed as it's a duplication of #417 .

sparkling9809 added the question Further information is requested label Aug 11, 2023

minseokl assigned shijieliu Aug 16, 2023

JacoCheung mentioned this issue Sep 8, 2023

[Question] An illegal memory access was encountered on H800 & Hugectr dcn test #417

Closed

JacoCheung closed this as completed Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Does HugeCtr support H800 GPU？ #414

[Question] Does HugeCtr support H800 GPU？ #414

sparkling9809 commented Aug 11, 2023

EmmaQiaoCh commented Aug 15, 2023

sparkling9809 commented Aug 15, 2023

sparkling9809 commented Aug 21, 2023

shijieliu commented Aug 22, 2023 •

edited

Loading

sparkling9809 commented Sep 5, 2023

JacoCheung commented Sep 18, 2023

[Question] Does HugeCtr support H800 GPU？ #414

[Question] Does HugeCtr support H800 GPU？ #414

Comments

sparkling9809 commented Aug 11, 2023

EmmaQiaoCh commented Aug 15, 2023

sparkling9809 commented Aug 15, 2023

sparkling9809 commented Aug 21, 2023

shijieliu commented Aug 22, 2023 • edited Loading

sparkling9809 commented Sep 5, 2023

JacoCheung commented Sep 18, 2023

shijieliu commented Aug 22, 2023 •

edited

Loading