[Question] An illegal memory access was encountered on H800 & Hugectr dcn test #417

dusir · 2023-09-07T12:17:12Z

客户使用NVIDIA-Merlin/HugeCTR github社区软件在H800上跑模型训练，用的英伟达提供的默认的开源社区的23.06版本的容器，遇到cuda访问踩内存问题，我们没有编译hugectr，直接用的社区提供的容器；

问题现象如下：
直接使用社区原始的hugectr 23.06 容器镜像，跑dcn测试，遇到如下问题：
root@jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:/home/workspace/xx# python dcn.py --check solo --features_num 91
[2023-08-30 06:34:47,329] dcn.py-> line:41 [INFO]特征个数为: 91
MpiInitService: MPI was already initialized by another (non-HugeCTR) mechanism.
[HCTR][06:34:47.329][INFO][RK0][main]: Empty embedding, trained table will be stored in /root/wd2kw_seq_0_sparse_model
HugeCTR Version: 23.6
====================================================Model Init=====================================================
[HCTR][06:34:47.329][INFO][RK0][main]: Initialize model: wd2kw_seq
[HCTR][06:34:47.329][INFO][RK0][main]: Global seed is 2422757165
[HCTR][06:34:47.461][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 1
GPU 1 -> node 1
GPU 2 -> node 1
NCCL version 2.17.1+cuda12.1
[HCTR][06:34:50.985][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 76.9557
[HCTR][06:34:50.985][DEBUG][RK0][main]: [device 1] allocating 0.0000 GB, available 76.9557
[HCTR][06:34:50.986][DEBUG][RK0][main]: [device 2] allocating 0.0000 GB, available 76.9557
[HCTR][06:34:50.986][INFO][RK0][main]: Start all2all warmup
[HCTR][06:34:52.220][INFO][RK0][main]: End all2all warmup
[HCTR][06:34:52.233][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][06:34:52.238][INFO][RK0][main]: Device 0: NVIDIA H800
[HCTR][06:34:52.239][INFO][RK0][main]: Device 1: NVIDIA H800
[HCTR][06:34:52.240][INFO][RK0][main]: Device 2: NVIDIA H800
[HCTR][06:34:52.290][INFO][RK0][main]: eval source /root/gq/eval.txt max_row_group_size 1095656
[HCTR][06:34:52.337][INFO][RK0][main]: train source /root/gq/0.txt max_row_group_size 1095656
[HCTR][06:34:52.385][INFO][RK0][main]: train source /root/gq/1.txt max_row_group_size 1095656
[HCTR][06:34:52.432][INFO][RK0][main]: train source /root/gq/2.txt max_row_group_size 1095656
[HCTR][06:34:52.432][INFO][RK0][main]: num of DataReader workers for train: 3
[HCTR][06:34:52.432][INFO][RK0][main]: num of DataReader workers for eval: 1
[HCTR][06:34:52.780][INFO][RK0][main]: max_vocabulary_size_per_gpu_=145817600
[HCTR][06:34:52.854][DEBUG][RK0][main]: [device 0] allocating 27.1206 GB, available 41.2467
[HCTR][06:34:52.912][DEBUG][RK0][tid #139961667938048]: [device 1] allocating 27.1206 GB, available 43.5807
[HCTR][06:34:52.916][DEBUG][RK0][tid #139961676330752]: [device 2] allocating 27.1206 GB, available 43.5807
[HCTR][06:34:52.922][INFO][RK0][main]: Graph analysis to resolve tensor dependency
[HCTR][06:34:52.922][INFO][RK0][main]: Add Slice layer for tensor: reshape1, creating 2 copies
[HCTR][06:34:52.923][WARNING][RK0][main]: using multi-cross v1
[HCTR][06:34:52.932][WARNING][RK0][main]: using multi-cross v1
[HCTR][06:34:52.939][WARNING][RK0][main]: using multi-cross v1
[HCTR][06:34:52.946][WARNING][RK0][main]: using multi-cross v1
[HCTR][06:34:52.952][WARNING][RK0][main]: using multi-cross v1
[HCTR][06:34:52.958][WARNING][RK0][main]: using multi-cross v1
===================================================Model Compile===================================================
DCN search_algo done
DCN search_algo done
[HCTR][06:35:13.739][INFO][RK0][main]: [HCTR][06:35:13.739][INFO][RK0][tid #139961676330752]: gpu0 start to init embedding
gpu2 start to init embedding
[HCTR][06:35:13.739][INFO][RK0][tid #139961667938048]: gpu1 start to init embedding
[HCTR][06:35:13.761][INFO][RK0][tid #139961676330752]: gpu2 init embedding done
[HCTR][06:35:13.761][INFO][RK0][tid #139961667938048]: gpu1 init embedding done
[HCTR][06:35:13.762][INFO][RK0][main]: gpu0 init embedding done
[HCTR][06:35:13.762][INFO][RK0][main]: Enable HMEM-Based Parameter Server
[HCTR][06:35:13.762][INFO][RK0][main]: /root/wd2kw_seq_0_sparse_model not exist, create and train from scratch
[HCTR][06:35:52.175][DEBUG][RK0][main]: [device 0] allocating 1.0864 GB, available 37.3600
[HCTR][06:35:52.178][DEBUG][RK0][main]: [device 1] allocating 1.0864 GB, available 39.6940
[HCTR][06:35:52.181][DEBUG][RK0][main]: [device 2] allocating 1.0864 GB, available 39.6940
[HCTR][06:35:52.233][INFO][RK0][main]: Starting AUC NCCL warm-up
[HCTR][06:35:52.269][INFO][RK0][main]: Warm-up done
===================================================Model Summary===================================================
[HCTR][06:35:52.269][INFO][RK0][main]: Model structure on each GPU
Label Dense Sparse
if_click dense data1
(12000, 1) (12000, 0)
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type Input Name Output Name Output Shape
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
DistributedSlotSparseEmbeddingHash data1 sparse_embedding1 (12000, 91, 16)

Reshape sparse_embedding1 reshape1 (12000, 1456)

Slice reshape1 reshape1_slice0 (12000, 1456)
reshape1_slice1 (12000, 1456)

MultiCross reshape1_slice0 multicross1 (12000, 1456)

InnerProduct reshape1_slice1 fc1 (12000, 1024)

ReLU fc1 relu1 (12000, 1024)

Dropout relu1 dropout1 (12000, 1024)

InnerProduct dropout1 fc2 (12000, 1024)

ReLU fc2 relu2 (12000, 1024)

Dropout relu2 dropout2 (12000, 1024)

Concat dropout2 concat2 (12000, 2480)
multicross1

InnerProduct concat2 fc3 (12000, 1)

BinaryCrossEntropyLoss fc3 loss
if_click

[HCTR][06:35:52.410][INFO][RK0][main]: Save the model graph to wd2kw_seq.json successfully
=====================================================Model Fit=====================================================
[HCTR][06:35:52.410][INFO][RK0][main]: Use embedding training cache mode with number of training sources: 3, number of epochs: 1
[HCTR][06:35:52.410][INFO][RK0][main]: Training batchsize: 36000, evaluation batchsize: 36000
[HCTR][06:35:52.410][INFO][RK0][main]: Evaluation interval: 100, snapshot interval: 10000
[HCTR][06:35:52.410][INFO][RK0][main]: Dense network trainable: True
[HCTR][06:35:52.410][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][06:35:52.410][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: True
[HCTR][06:35:52.410][INFO][RK0][main]: lr: 0.001000, warmup_steps: 1, end_lr: 0.000000
[HCTR][06:35:52.410][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][06:35:52.458][INFO][RK0][main]: Evaluation source file: /root/gq/eval.txt
[HCTR][06:35:52.458][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/0.txt--------------------
[HCTR][06:35:52.608][INFO][RK0][main]: Preparing embedding table for next pass
[HCTR][06:35:57.749][INFO][RK0][main]: Evaluation, AUC: 0.685634
[HCTR][06:35:57.749][INFO][RK0][main]: Eval Time for 5000 iters: 0.202734s
[HCTR][06:36:02.303][INFO][RK0][main]: Evaluation, AUC: 0.694559
[HCTR][06:36:02.303][INFO][RK0][main]: Eval Time for 5000 iters: 0.203171s
[HCTR][06:36:06.854][INFO][RK0][main]: Evaluation, AUC: 0.697146
[HCTR][06:36:06.854][INFO][RK0][main]: Eval Time for 5000 iters: 0.202961s
[HCTR][06:36:11.388][INFO][RK0][main]: Evaluation, AUC: 0.700038
[HCTR][06:36:11.388][INFO][RK0][main]: Eval Time for 5000 iters: 0.203444s
[HCTR][06:36:13.779][INFO][RK0][main]: train drop incomplete batch. batchsize:6044
[HCTR][06:36:13.809][INFO][RK0][main]: train drop incomplete batch. batchsize:6046
[HCTR][06:36:13.809][INFO][RK0][main]: train drop incomplete batch. batchsize:5971
[HCTR][06:36:13.809][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/1.txt--------------------
[HCTR][06:36:13.955][INFO][RK0][main]: Preparing embedding table for next pass
[HCTR][06:36:25.812][INFO][RK0][main]: Iter: 500 Time(500 iters): 33.3733s Loss: 0.0613855 lr:0.001
[HCTR][06:36:26.027][INFO][RK0][main]: Evaluation, AUC: 0.701749
[HCTR][06:36:26.027][INFO][RK0][main]: Eval Time for 5000 iters: 0.204123s
[HCTR][06:36:30.580][INFO][RK0][main]: Evaluation, AUC: 0.704503
[HCTR][06:36:30.580][INFO][RK0][main]: Eval Time for 5000 iters: 0.204139s
[HCTR][06:36:35.124][INFO][RK0][main]: Evaluation, AUC: 0.705316
[HCTR][06:36:35.124][INFO][RK0][main]: Eval Time for 5000 iters: 0.203679s
[HCTR][06:36:39.677][INFO][RK0][main]: Evaluation, AUC: 0.707184
[HCTR][06:36:39.677][INFO][RK0][main]: Eval Time for 5000 iters: 0.203593s
[HCTR][06:36:44.205][INFO][RK0][main]: Evaluation, AUC: 0.708642
[HCTR][06:36:44.205][INFO][RK0][main]: Eval Time for 5000 iters: 0.20354s
[HCTR][06:36:44.748][INFO][RK0][main]: train drop incomplete batch. batchsize:6128
[HCTR][06:36:44.778][INFO][RK0][main]: train drop incomplete batch. batchsize:5969
[HCTR][06:36:44.778][INFO][RK0][main]: train drop incomplete batch. batchsize:6089
[HCTR][06:36:44.778][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/2.txt--------------------
[HCTR][06:36:44.924][INFO][RK0][main]: Preparing embedding table for next pass
[HCTR][06:36:45.719][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
[HCTR][06:36:45.719][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
terminate called after throwing an instance of 'cudf::fatal_cuda_error'
what(): Fatal CUDA error encountered at: /opt/rapids/src/cudf/cpp/include/cudf/detail/utilities/pinned_allocator.hpp:170: 700 cudaErrorIllegalAddress an illegal memory access was encountered
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] *** Process received signal ***
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] Signal: Aborted (6)
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] Signal code: (-6)
Traceback (most recent call last):
File "dcn.py", line 175, in
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f4e0d4e6090]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f4e0d4e600b]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f4e0d4c5859]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f4e0409b911]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f4e040a738c]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9369)[0x7f4e040a6369]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a1)[0x7f4e040a6d21]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 7] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bef)[0x7f4e03ff2bef]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a)[0x7f4e03ff35aa]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 9] /lib/libcudf.so(_ZN4cudf6detail16throw_cuda_errorE9cudaErrorPKcj+0x58c)[0x7f4da7ff582c]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [10] /lib/libcudf.so(+0x188c77b)[0x7f4da889577b]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [11] /lib/libcudf.so(_ZN17hostdevice_vectorIPvED1Ev+0x67)[0x7f4da8894e67]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [12] /lib/libcudf.so(+0xe17c44)[0x7f4da7e20c44]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [13] /lib/libcudf.so(_ZN4cudf2io6detail7parquet6reader4impl19read_chunk_internalEb+0x37f)[0x7f4da8893e7f]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [14] /lib/libcudf.so(_ZN4cudf2io6detail7parquet6reader4impl4readEiibNS_9host_spanIKSt6vectorIiSaIiEELm18446744073709551615EEE+0x5c)[0x7f4da889488c]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [15] /lib/libcudf.so(_ZN4cudf2io6detail7parquet6reader4readERKNS0_22parquet_reader_optionsE+0x78)[0x7f4da888ef68]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [16] /lib/libcudf.so(_ZN4cudf2io12read_parquetERKNS0_22parquet_reader_optionsEPN3rmm2mr22device_memory_resourceE+0xc4)[0x7f4da87aafd4]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [17] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR17ParquetFileSource10read_groupEmPN3rmm2mr22device_memory_resourceE+0x322)[0x7f4e05311112]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [18] /usr/local/hugectr/lib/libhuge_ctr_shared.so(ZN7HugeCTR21RowGroupReadingThreadIxE18get_one_read_groupERKSt6vectorINS_21DataReaderSparseParamESaIS3_EERS2_ImSaImEERS2_IiSaIiEESD+0xa0)[0x7f4e05349b10]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [19] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR13core23_reader23ParquetDataReaderWorkerIxE6do_h2dEv+0x18a)[0x7f4e0533c58a]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [20] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0x5b7efe)[0x7f4e052d7efe]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [21] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f4e040d3de4]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [22] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f4e0d488609]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [23] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f4e0d5c2133]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] *** End of error message ***
client_loop: send disconnect: Broken pipe

gdb堆栈如下：
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7ded859 in __GI_abort () at abort.c:79
#2 0x00007fffee991911 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007fffee99d38c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007fffee99c369 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007fffee99cd21 in __gxx_personality_v0 () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007fffee8e8bef in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7 0x00007fffee8e95aa in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8 0x00007fff928f882c in cudf::detail::throw_cuda_error(cudaError, char const*, unsigned int) () from /lib/libcudf.so
#9 0x00007fff9319877b in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<std::__detail::__variant::__variant_cookie ()(std::__detail::__variant::_Variant_storage<false, thrust::host_vector<void, std::allocator<void*> >, thrust::host_vector<void*, cudf::detail::pinned_allocator<void*> > >::_M_reset_impl()::{lambda(auto:1&&)#1}&&, std::variant<thrust::host_vector<void*, std::allocator<void*> >, thrust::host_vector<void*, cudf::detail::pinned_allocator<void*> > >&)>, std::tuple<std::variant<thrust::host_vector<void*, std::allocator<void*> >, thrust::host_vector<void*, cudf::detail::pinned_allocator<void*> > > >, std::integer_sequence<unsigned long, 1ul> >::__visit_invoke(std::__detail::__variant::_Variant_storage<false, thrust::host_vector<void*, std::allocator<void*> >, thrust::host_vector<void*, cudf::detail::pinned_allocator<void*> > >::M_reset_impl()::{lambda(auto:1&&)#1}, std::variant<thrust::host_vector<void*, std::allocator<void*> >, thrust::host_vector<void*, cudf::detail::pinned_allocator<void*> > >) () from /lib/libcudf.so
#10 0x00007fff93197e67 in hostdevice_vector<void*>::~hostdevice_vector() () from /lib/libcudf.so
#11 0x00007fff92723c44 in cudf::io::detail::parquet::reader::impl::decode_page_data(unsigned long, unsigned long) [clone .cold] ()
from /lib/libcudf.so
#12 0x00007fff93196e7f in cudf::io::detail::parquet::reader::impl::read_chunk_internal(bool) () from /lib/libcudf.so
#13 0x00007fff9319788c in cudf::io::detail::parquet::reader::impl::read(int, int, bool, cudf::host_span<std::vector<int, std::allocator > const, 18446744073709551615ul>) () from /lib/libcudf.so
#14 0x00007fff93191f68 in cudf::io::detail::parquet::reader::read(cudf::io::parquet_reader_options const&) () from /lib/libcudf.so
#15 0x00007fff930adfd4 in cudf::io::read_parquet(cudf::io::parquet_reader_options const&, rmm::mr::device_memory_resource*) ()
from /lib/libcudf.so
#16 0x00007fffefc07112 in HugeCTR::ParquetFileSource::read_group(unsigned long, rmm::mr::device_memory_resource*) ()
from /usr/local/hugectr/lib/libhuge_ctr_shared.so
#17 0x00007fffefc3fb10 in HugeCTR::RowGroupReadingThread::get_one_read_group(std::vector<HugeCTR::DataReaderSparseParam, std::allocatorHugeCTR::DataReaderSparseParam > const&, std::vector<unsigned long, std::allocator >&, std::vector<int, std::allocator >&, std::vector<int, std::allocator >&) () from /usr/local/hugectr/lib/libhuge_ctr_shared.so
#18 0x00007fffefc3258a in HugeCTR::core23_reader::ParquetDataReaderWorker::do_h2d() ()
from /usr/local/hugectr/lib/libhuge_ctr_shared.so
#19 0x00007fffefbcdefe in HugeCTR::producer_thread_func(std::shared_ptrHugeCTR::IDataReaderWorker const&, std::shared_ptr<std::atomic > const&, int, bool volatile*) () from /usr/local/hugectr/lib/libhuge_ctr_shared.so
#20 0x00007fffee9c9de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007ffff7db0609 in start_thread (arg=) at pthread_create.c:477
#22 0x00007ffff7eea133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

环境信息：
nccl:
NCCL version 2.17.1+cuda12.1

hugectr 容器业务镜像版本：23.06，社区提供的，非客户编译的容器镜像。

如何复现？
keyset如果跑三个txt数据集就会出现踩内存的问题，如果只跑一个txt数据就没有问题。
我的数据如下：
测试数据目录keyset里如下：
0.keyset 1.keyset 2.keyset 3.keyset 4.keyset 5.keyset 6.keyset 7.keyset 8.keyset 9.keyset eval.txt
0.txt 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt

只要跑三个txt就有问题。

辛苦帮看看是不是因为不支持H800导致的，多谢了。

JacoCheung · 2023-09-08T01:31:16Z

Hi @dusir , is this issue exactly same with #414 ? Can I assume the script in #414 is what caused the error?

From the log, I feel like there may be some misconfigurations on either the ETC or the dataset...

JacoCheung · 2023-09-08T01:41:29Z

I suggest doing following things step by step to narrow down the problem scope:

If possible, can we run on other platforms to reproduce this error ? Say V100/A100
Turn off etc to see if the problem is still there. (To turn off etc, all source files must be consolidated as one)
(If in step 2, the problem is not solved) Use repeat mode instead of epoch mode (Setting the Solver::repeat_dataset=True will enable repeat mode, see doc )

dusir · 2023-09-08T10:04:59Z

I suggest doing following things step by step to narrow down the problem scope:

If possible, can we run on other platforms to reproduce this error ? Say V100/A100

Turn off etc to see if the problem is still there. (To turn off etc, all source files must be consolidated as one)

(If in step 2, the problem is not solved) Use repeat mode instead of epoch mode (Setting the Solver::repeat_dataset=True will enable repeat mode, see doc )

Hi,
1.the V100 is ok,has no problem;
2.Turn off the etc,has no issue.
3.to set the opton Solver::repeat_dataset=True,we find an error:

JacoCheung · 2023-10-17T08:22:52Z

Close as it's fixed.

dusir added the question Further information is requested label Sep 7, 2023

zehuanw changed the title ~~[Question] H800上跑hugectr dcn测试，遇到踩内存的问题~~ [Question] An illegal memory access was encountered on H800 & Hugectr dcn test Sep 7, 2023

zehuanw added bug It's a bug / potential bug and need verification stage::doing labels Sep 11, 2023

viswa-nvidia added the P0 Must have label Sep 11, 2023

JacoCheung self-assigned this Sep 18, 2023

JacoCheung mentioned this issue Sep 18, 2023

[Question] Does HugeCtr support H800 GPU？ #414

Closed

JacoCheung closed this as completed Oct 17, 2023

JacoCheung mentioned this issue Nov 24, 2023

[BUG] Encountered ETC error of din model when training with multiple keyset. #429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] An illegal memory access was encountered on H800 & Hugectr dcn test #417

[Question] An illegal memory access was encountered on H800 & Hugectr dcn test #417

dusir commented Sep 7, 2023 •

edited

Loading

JacoCheung commented Sep 8, 2023

JacoCheung commented Sep 8, 2023 •

edited

Loading

dusir commented Sep 8, 2023 •

edited

Loading

JacoCheung commented Oct 17, 2023

[Question] An illegal memory access was encountered on H800 & Hugectr dcn test #417

[Question] An illegal memory access was encountered on H800 & Hugectr dcn test #417

Comments

dusir commented Sep 7, 2023 • edited Loading

Reshape sparse_embedding1 reshape1 (12000, 1456)

Slice reshape1 reshape1_slice0 (12000, 1456) reshape1_slice1 (12000, 1456)

MultiCross reshape1_slice0 multicross1 (12000, 1456)

InnerProduct reshape1_slice1 fc1 (12000, 1024)

ReLU fc1 relu1 (12000, 1024)

Dropout relu1 dropout1 (12000, 1024)

InnerProduct dropout1 fc2 (12000, 1024)

ReLU fc2 relu2 (12000, 1024)

Dropout relu2 dropout2 (12000, 1024)

Concat dropout2 concat2 (12000, 2480) multicross1

InnerProduct concat2 fc3 (12000, 1)

BinaryCrossEntropyLoss fc3 loss if_click

JacoCheung commented Sep 8, 2023

JacoCheung commented Sep 8, 2023 • edited Loading

dusir commented Sep 8, 2023 • edited Loading

JacoCheung commented Oct 17, 2023

dusir commented Sep 7, 2023 •

edited

Loading

Slice reshape1 reshape1_slice0 (12000, 1456)
reshape1_slice1 (12000, 1456)

Concat dropout2 concat2 (12000, 2480)
multicross1

BinaryCrossEntropyLoss fc3 loss
if_click

JacoCheung commented Sep 8, 2023 •

edited

Loading

dusir commented Sep 8, 2023 •

edited

Loading