Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Does HugeCtr support H800 GPU? #414

Closed
sparkling9809 opened this issue Aug 11, 2023 · 6 comments
Closed

[Question] Does HugeCtr support H800 GPU? #414

sparkling9809 opened this issue Aug 11, 2023 · 6 comments
Assignees
Labels
question Further information is requested

Comments

@sparkling9809
Copy link

I run the embedding_test in HugeCtr on H800, but it failed, the exception follow is :

root@jupyuterlab-nb-1691543551529-ddf9dcb96-jr455:/usr/local/hugectr/bin# ./embedding_test
Running main() from /hugectr/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 278 tests from 28 test suites.
[----------] Global test environment set-up.
[----------] 28 tests from distributed_sparse_embedding_hash_test
[ RUN ] distributed_sparse_embedding_hash_test.fp32_sgd_1gpu
MpiInitService: MPI was already initialized by another (non-HugeCTR) mechanism.
[HCTR][09:27:01.919][INFO][RK0][main]: Global seed is 1544237699
[HCTR][09:27:01.994][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 1
[HCTR][09:27:02.470][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][09:27:02.470][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 76.5250
[HCTR][09:27:02.470][INFO][RK0][main]: Start all2all warmup
[HCTR][09:27:02.471][INFO][RK0][main]: End all2all warmup
[HCTR][09:27:02.472][INFO][RK0][main]: ./data_reader_test_data/temp_dataset_0.data
[HCTR][09:27:02.757][INFO][RK0][main]: train_file_list.txt done!
[HCTR][09:27:02.757][INFO][RK0][main]: ./data_reader_test_data exist
[HCTR][09:27:02.757][INFO][RK0][main]: ./data_reader_test_data/temp_dataset_0.data
[HCTR][09:27:02.828][INFO][RK0][main]: test_file_list.txt done!
[HCTR][09:27:02.828][DEBUG][RK0][main]: [device 0] allocating 0.0012 GB, available 76.2593
[HCTR][09:27:02.828][DEBUG][RK0][main]: [device 0] allocating 0.0030 GB, available 76.2554
[HCTR][09:27:03.179][INFO][RK0][main]: max_vocabulary_size_per_gpu_=100000
[HCTR][09:27:03.184][ERROR][RK0][main]: CUDA RT call "cudaGetLastError()" in line 341 of file /hugectr/HugeCTR/include/hashtable/cudf/concurrent_unordered_map.cuh failed with no kernel image is available for execution on the device (209).
root@jupyuterlab-nb-1691543551529-ddf9dcb96-jr455:/usr/local/hugectr/bin#

the cuda version: 12.2
HugeCtr docker image : Merlin-hugectr:23.02

@sparkling9809 sparkling9809 added the question Further information is requested label Aug 11, 2023
@EmmaQiaoCh
Copy link
Collaborator

Hi, thanks for trying HugeCTR.
Could you use our latest image 23.06? Thanks.

@sparkling9809
Copy link
Author

Yes. The problem above has solved when I changed the image version to 23.06;
But when I run the trainning on multiple H800 gpu, there is a new problem:

[HCTR][06:01:46.761][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/2.txt--------------------
[HCTR][06:01:46.826][INFO][RK0][main]: Preparing embedding table for next pass
[HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
[HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
terminate called after throwing an instance of 'cudf::fatal_cuda_error'
what(): Fatal CUDA error encountered at: /opt/rapids/src/cudf/cpp/include/cudf/detail/utilities/pinned_allocator.hpp:170: 700 cudaErrorIllegalAddress an illegal memory access was encountered
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] *** Process received signal ***
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal: Aborted (6)
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal code: (-6)
Traceback (most recent call last):
File "dcn_init_train.py", line 175, in

I found the method in hugectr readme doc:

NOTE: HugeCTR uses NCCL to share data between ranks, and NCCL may requires shared memory for IPC and pinned (page-locked) system memory resources. It is recommended that you increase these resources by issuing the following options in the docker run command.

-shm-size=1g -ulimit memlock=-1

I have tried the method, but the problem doesn't disappear.

@sparkling9809
Copy link
Author

Is there any progresses for this question?

@shijieliu
Copy link
Collaborator

shijieliu commented Aug 22, 2023

hi @sparkling9809 which training script are you using? From the log I can tell you are trying to use Embedding Training Cache, is this expected?

If you want to try some sample and not require Embedding Training Cache, we advise you to try EmbeddingCollection, which is our latest embedding implementation and old ones will be deprecated in the future. Here is the doc and sample for reference.

@sparkling9809
Copy link
Author

hi @sparkling9809 which training script are you using? From the log I can tell you are trying to use Embedding Training Cache, is this expected?

If you want to try some sample and not require Embedding Training Cache, we advise you to try EmbeddingCollection, which is our latest embedding implementation and old ones will be deprecated in the future. Here is the doc and sample for reference.

Thanks for your reply!

The script for trainnign as follows:



import argparse
import hugectr
from mpi4py import MPI
import time
from tools.utils import Log
logger = Log(__name__).getlog()

arg_parser = argparse.ArgumentParser(description="model train")
arg_parser.add_argument("--features_num", type=int, required=True)
arg_parser.add_argument("--check", type=str, required=True)
args = arg_parser.parse_args()
total_num = args.features_num*16


all_solt = [10000,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            50,50,110000000,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,
            10000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
# for i in range(len(all_solt)):
#     if all_solt[i] < 1:
#         all_solt[i] = 10000
# all_solt[-2] = 0
All_Solt = all_solt[:args.features_num]
# All_Solt = all_solt
if args.check == "solo":
    # Source = ["./testdata0717/train_solo/0.txt", "./testdata0717/train_solo/1.txt", "./testdata0717/train_solo/2.txt"]
    # Keyset = ["./testdata0717/train_solo/0.keyset", "./testdata0717/train_solo/1.keyset", "./testdata0717/train_solo/2.keyset"]
    # Source = ["./testdata0717/train_all_mini/0.txt"]
    # Keyset = ["./testdata0717/train_all_mini/all.keyset"]
    # Source = ["/home/workspace/hy/tmp/"+str(i)+".txt" for i in range(3)]
    # Keyset = ["/home/workspace/hy/tmp/" + str(i) + ".keyset" for i in range(3)]
    Source=["/root/91feature_data/91_keyset/"+str(i)+".txt" for i in range(3)]
    Keyset = ["/root/91feature_data/91_keyset/" + str(i) + ".keyset" for i in range(3) ]
elif args.check == "all":
    # Source = ["./testdata0717/train_all/all.txt"]
    # Keyset = ["./testdata0717/train_all/all.keyset"]
    Source = ["/root/91_keyset/0.txt"]
    Keyset = ["/root/91_keyset/0.keyset"]
else:
    raise ValueError("check类型错误, 请输入 solo 或 all")
# logger.info(f"特征个数为: {args.features_num}")
    
solver = hugectr.CreateSolver(model_name = "wd2kw_seq",
                              max_eval_batches = 5000,
                              batchsize_eval = 36000,
                              # batchsize = 10240,
                              batchsize = 36000,
                              #batchsize = 1000,
                              lr = 0.001, 
                              vvgpu = [[0,1,2]],
                              i64_input_key = True,
                              use_mixed_precision = False,
                              repeat_dataset = False,
                              use_cuda_graph = True
                              # kafka_brockers = "10.68.225.168:9092,10.68.226.229:9092,10.68.227.181:9092"
                             )

reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = Source, keyset = Keyset,
                                  eval_source="/root/91feature_data/91_keyset/eval.txt",
                                  num_workers=30,
                                  slot_size_array=All_Solt,
                                  check_type = hugectr.Check_t.Sum)
# reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
#                                   source = Source, keyset = Keyset,
#                                   eval_source="/home/workspace/hy/tmp/0.txt",
#                                   num_workers=30,
#                                   slot_size_array=All_Solt,
#                                   check_type = hugectr.Check_t.Sum)


optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)

etc = hugectr.CreateETC(ps_types = [hugectr.TrainPSType_t.Staged],
                        sparse_models = ["/root/wd2kw_seq_0_sparse_model"],\
                        local_paths = ["/root/"])

model = hugectr.Model(solver, reader, optimizer, etc)
model.add(hugectr.Input(label_dim = 1, label_name = "if_click",
                        dense_dim = 0, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("data1", 1, False, args.features_num)]))

model.add(
    hugectr.SparseEmbedding(
        embedding_type=hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
        workspace_size_per_gpu_in_mb=26700,
        embedding_vec_size=16,
        combiner="sum",
        sparse_embedding_name="sparse_embedding1",
        bottom_name="data1",
        optimizer=optimizer,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Reshape,
        bottom_names=["sparse_embedding1"],
        top_names=["reshape1"],
        leading_dim=total_num,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.MultiCross,
        bottom_names=["reshape1"],
        top_names=["multicross1"],
        num_layers=6,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["reshape1"],
        top_names=["fc1"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc1"], top_names=["relu1"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu1"],
        top_names=["dropout1"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout1"],
        top_names=["fc2"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc2"], top_names=["relu2"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu2"],
        top_names=["dropout2"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Concat,
        bottom_names=["dropout2", "multicross1"],
        top_names=["concat2"],
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["concat2"],
        top_names=["fc3"],
        num_output=1,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.BinaryCrossEntropyLoss,
        bottom_names=["fc3", "if_click"],
        top_names=["loss"],
    )
)

model.compile()
model.summary()
model.graph_to_json(graph_config_file = "wd2kw_seq.json")
#model.save_params_to_files("wdl")
model.fit(num_epochs = 1, display = 500, eval_interval = 100)

model.save_params_to_files("/root/wd2kw_seq")
s_to_files("/root/wd2kw_seq")

The script runs ok on 8 H800 GPU in single machine. But there is something wrong when the source include files greater than 3.

reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = Source,      // When the number of files included in source greater than 3, the wrong will happen!
                                   keyset = Keyset,     
                                  eval_source="/root/91feature_data/91_keyset/eval.txt",
                                  num_workers=30,
                                  slot_size_array=All_Solt,
                                  check_type = hugectr.Check_t.Sum)

the exception as follows:

[HCTR][06:01:46.761][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/2.txt--------------------
[HCTR][06:01:46.826][INFO][RK0][main]: Preparing embedding table for next pass
[HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
[HCTR][06:01:47.696][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
terminate called after throwing an instance of 'cudf::fatal_cuda_error'
what(): Fatal CUDA error encountered at: /opt/rapids/src/cudf/cpp/include/cudf/detail/utilities/pinned_allocator.hpp:170: 700 cudaErrorIllegalAddress an illegal memory access was encountered
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] *** Process received signal ***
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal: Aborted (6)
[jupyuterlab-nb-1691568254333-866cf6f4f-855sl:02075] Signal code: (-6)
Traceback (most recent call last):
File "dcn_init_train.py", line 175,

@JacoCheung
Copy link
Collaborator

closed as it's a duplication of #417 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants