Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] An illegal memory access was encountered on H800 & Hugectr dcn test #417

Closed
dusir opened this issue Sep 7, 2023 · 4 comments
Closed
Assignees
Labels
bug It's a bug / potential bug and need verification P0 Must have question Further information is requested stage::doing

Comments

@dusir
Copy link

dusir commented Sep 7, 2023

客户使用NVIDIA-Merlin/HugeCTR github社区软件在H800上跑模型训练,用的英伟达提供的默认的开源社区的23.06版本的容器,遇到cuda访问踩内存问题,我们没有编译hugectr,直接用的社区提供的容器;

问题现象如下:
直接使用社区原始的hugectr 23.06 容器镜像,跑dcn测试,遇到如下问题:
root@jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:/home/workspace/xx# python dcn.py --check solo --features_num 91
[2023-08-30 06:34:47,329] dcn.py-> line:41 [INFO]特征个数为: 91
MpiInitService: MPI was already initialized by another (non-HugeCTR) mechanism.
[HCTR][06:34:47.329][INFO][RK0][main]: Empty embedding, trained table will be stored in /root/wd2kw_seq_0_sparse_model
HugeCTR Version: 23.6
====================================================Model Init=====================================================
[HCTR][06:34:47.329][INFO][RK0][main]: Initialize model: wd2kw_seq
[HCTR][06:34:47.329][INFO][RK0][main]: Global seed is 2422757165
[HCTR][06:34:47.461][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 1
GPU 1 -> node 1
GPU 2 -> node 1
NCCL version 2.17.1+cuda12.1
[HCTR][06:34:50.985][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 76.9557
[HCTR][06:34:50.985][DEBUG][RK0][main]: [device 1] allocating 0.0000 GB, available 76.9557
[HCTR][06:34:50.986][DEBUG][RK0][main]: [device 2] allocating 0.0000 GB, available 76.9557
[HCTR][06:34:50.986][INFO][RK0][main]: Start all2all warmup
[HCTR][06:34:52.220][INFO][RK0][main]: End all2all warmup
[HCTR][06:34:52.233][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][06:34:52.238][INFO][RK0][main]: Device 0: NVIDIA H800
[HCTR][06:34:52.239][INFO][RK0][main]: Device 1: NVIDIA H800
[HCTR][06:34:52.240][INFO][RK0][main]: Device 2: NVIDIA H800
[HCTR][06:34:52.290][INFO][RK0][main]: eval source /root/gq/eval.txt max_row_group_size 1095656
[HCTR][06:34:52.337][INFO][RK0][main]: train source /root/gq/0.txt max_row_group_size 1095656
[HCTR][06:34:52.385][INFO][RK0][main]: train source /root/gq/1.txt max_row_group_size 1095656
[HCTR][06:34:52.432][INFO][RK0][main]: train source /root/gq/2.txt max_row_group_size 1095656
[HCTR][06:34:52.432][INFO][RK0][main]: num of DataReader workers for train: 3
[HCTR][06:34:52.432][INFO][RK0][main]: num of DataReader workers for eval: 1
[HCTR][06:34:52.780][INFO][RK0][main]: max_vocabulary_size_per_gpu_=145817600
[HCTR][06:34:52.854][DEBUG][RK0][main]: [device 0] allocating 27.1206 GB, available 41.2467
[HCTR][06:34:52.912][DEBUG][RK0][tid #139961667938048]: [device 1] allocating 27.1206 GB, available 43.5807
[HCTR][06:34:52.916][DEBUG][RK0][tid #139961676330752]: [device 2] allocating 27.1206 GB, available 43.5807
[HCTR][06:34:52.922][INFO][RK0][main]: Graph analysis to resolve tensor dependency
[HCTR][06:34:52.922][INFO][RK0][main]: Add Slice layer for tensor: reshape1, creating 2 copies
[HCTR][06:34:52.923][WARNING][RK0][main]: using multi-cross v1
[HCTR][06:34:52.932][WARNING][RK0][main]: using multi-cross v1
[HCTR][06:34:52.939][WARNING][RK0][main]: using multi-cross v1
[HCTR][06:34:52.946][WARNING][RK0][main]: using multi-cross v1
[HCTR][06:34:52.952][WARNING][RK0][main]: using multi-cross v1
[HCTR][06:34:52.958][WARNING][RK0][main]: using multi-cross v1
===================================================Model Compile===================================================
DCN search_algo done
DCN search_algo done
[HCTR][06:35:13.739][INFO][RK0][main]: [HCTR][06:35:13.739][INFO][RK0][tid #139961676330752]: gpu0 start to init embedding
gpu2 start to init embedding
[HCTR][06:35:13.739][INFO][RK0][tid #139961667938048]: gpu1 start to init embedding
[HCTR][06:35:13.761][INFO][RK0][tid #139961676330752]: gpu2 init embedding done
[HCTR][06:35:13.761][INFO][RK0][tid #139961667938048]: gpu1 init embedding done
[HCTR][06:35:13.762][INFO][RK0][main]: gpu0 init embedding done
[HCTR][06:35:13.762][INFO][RK0][main]: Enable HMEM-Based Parameter Server
[HCTR][06:35:13.762][INFO][RK0][main]: /root/wd2kw_seq_0_sparse_model not exist, create and train from scratch
[HCTR][06:35:52.175][DEBUG][RK0][main]: [device 0] allocating 1.0864 GB, available 37.3600
[HCTR][06:35:52.178][DEBUG][RK0][main]: [device 1] allocating 1.0864 GB, available 39.6940
[HCTR][06:35:52.181][DEBUG][RK0][main]: [device 2] allocating 1.0864 GB, available 39.6940
[HCTR][06:35:52.233][INFO][RK0][main]: Starting AUC NCCL warm-up
[HCTR][06:35:52.269][INFO][RK0][main]: Warm-up done
===================================================Model Summary===================================================
[HCTR][06:35:52.269][INFO][RK0][main]: Model structure on each GPU
Label Dense Sparse
if_click dense data1
(12000, 1) (12000, 0)
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type Input Name Output Name Output Shape
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
DistributedSlotSparseEmbeddingHash data1 sparse_embedding1 (12000, 91, 16)

Reshape sparse_embedding1 reshape1 (12000, 1456)

Slice reshape1 reshape1_slice0 (12000, 1456)
reshape1_slice1 (12000, 1456)

MultiCross reshape1_slice0 multicross1 (12000, 1456)

InnerProduct reshape1_slice1 fc1 (12000, 1024)

ReLU fc1 relu1 (12000, 1024)

Dropout relu1 dropout1 (12000, 1024)

InnerProduct dropout1 fc2 (12000, 1024)

ReLU fc2 relu2 (12000, 1024)

Dropout relu2 dropout2 (12000, 1024)

Concat dropout2 concat2 (12000, 2480)
multicross1

InnerProduct concat2 fc3 (12000, 1)

BinaryCrossEntropyLoss fc3 loss
if_click

[HCTR][06:35:52.410][INFO][RK0][main]: Save the model graph to wd2kw_seq.json successfully
=====================================================Model Fit=====================================================
[HCTR][06:35:52.410][INFO][RK0][main]: Use embedding training cache mode with number of training sources: 3, number of epochs: 1
[HCTR][06:35:52.410][INFO][RK0][main]: Training batchsize: 36000, evaluation batchsize: 36000
[HCTR][06:35:52.410][INFO][RK0][main]: Evaluation interval: 100, snapshot interval: 10000
[HCTR][06:35:52.410][INFO][RK0][main]: Dense network trainable: True
[HCTR][06:35:52.410][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][06:35:52.410][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: True
[HCTR][06:35:52.410][INFO][RK0][main]: lr: 0.001000, warmup_steps: 1, end_lr: 0.000000
[HCTR][06:35:52.410][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][06:35:52.458][INFO][RK0][main]: Evaluation source file: /root/gq/eval.txt
[HCTR][06:35:52.458][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/0.txt--------------------
[HCTR][06:35:52.608][INFO][RK0][main]: Preparing embedding table for next pass
[HCTR][06:35:57.749][INFO][RK0][main]: Evaluation, AUC: 0.685634
[HCTR][06:35:57.749][INFO][RK0][main]: Eval Time for 5000 iters: 0.202734s
[HCTR][06:36:02.303][INFO][RK0][main]: Evaluation, AUC: 0.694559
[HCTR][06:36:02.303][INFO][RK0][main]: Eval Time for 5000 iters: 0.203171s
[HCTR][06:36:06.854][INFO][RK0][main]: Evaluation, AUC: 0.697146
[HCTR][06:36:06.854][INFO][RK0][main]: Eval Time for 5000 iters: 0.202961s
[HCTR][06:36:11.388][INFO][RK0][main]: Evaluation, AUC: 0.700038
[HCTR][06:36:11.388][INFO][RK0][main]: Eval Time for 5000 iters: 0.203444s
[HCTR][06:36:13.779][INFO][RK0][main]: train drop incomplete batch. batchsize:6044
[HCTR][06:36:13.809][INFO][RK0][main]: train drop incomplete batch. batchsize:6046
[HCTR][06:36:13.809][INFO][RK0][main]: train drop incomplete batch. batchsize:5971
[HCTR][06:36:13.809][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/1.txt--------------------
[HCTR][06:36:13.955][INFO][RK0][main]: Preparing embedding table for next pass
[HCTR][06:36:25.812][INFO][RK0][main]: Iter: 500 Time(500 iters): 33.3733s Loss: 0.0613855 lr:0.001
[HCTR][06:36:26.027][INFO][RK0][main]: Evaluation, AUC: 0.701749
[HCTR][06:36:26.027][INFO][RK0][main]: Eval Time for 5000 iters: 0.204123s
[HCTR][06:36:30.580][INFO][RK0][main]: Evaluation, AUC: 0.704503
[HCTR][06:36:30.580][INFO][RK0][main]: Eval Time for 5000 iters: 0.204139s
[HCTR][06:36:35.124][INFO][RK0][main]: Evaluation, AUC: 0.705316
[HCTR][06:36:35.124][INFO][RK0][main]: Eval Time for 5000 iters: 0.203679s
[HCTR][06:36:39.677][INFO][RK0][main]: Evaluation, AUC: 0.707184
[HCTR][06:36:39.677][INFO][RK0][main]: Eval Time for 5000 iters: 0.203593s
[HCTR][06:36:44.205][INFO][RK0][main]: Evaluation, AUC: 0.708642
[HCTR][06:36:44.205][INFO][RK0][main]: Eval Time for 5000 iters: 0.20354s
[HCTR][06:36:44.748][INFO][RK0][main]: train drop incomplete batch. batchsize:6128
[HCTR][06:36:44.778][INFO][RK0][main]: train drop incomplete batch. batchsize:5969
[HCTR][06:36:44.778][INFO][RK0][main]: train drop incomplete batch. batchsize:6089
[HCTR][06:36:44.778][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/2.txt--------------------
[HCTR][06:36:44.924][INFO][RK0][main]: Preparing embedding table for next pass
[HCTR][06:36:45.719][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
[HCTR][06:36:45.719][ERROR][RK0][main]: Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(local_gpu->get_stream()) at sync_all_gpus (/hugectr/HugeCTR/src/embeddings/sync_all_gpus_functor.cu:28)
terminate called after throwing an instance of 'cudf::fatal_cuda_error'
what(): Fatal CUDA error encountered at: /opt/rapids/src/cudf/cpp/include/cudf/detail/utilities/pinned_allocator.hpp:170: 700 cudaErrorIllegalAddress an illegal memory access was encountered
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] *** Process received signal ***
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] Signal: Aborted (6)
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] Signal code: (-6)
Traceback (most recent call last):
File "dcn.py", line 175, in
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f4e0d4e6090]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f4e0d4e600b]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f4e0d4c5859]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f4e0409b911]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f4e040a738c]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9369)[0x7f4e040a6369]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a1)[0x7f4e040a6d21]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 7] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bef)[0x7f4e03ff2bef]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a)[0x7f4e03ff35aa]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [ 9] /lib/libcudf.so(_ZN4cudf6detail16throw_cuda_errorE9cudaErrorPKcj+0x58c)[0x7f4da7ff582c]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [10] /lib/libcudf.so(+0x188c77b)[0x7f4da889577b]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [11] /lib/libcudf.so(_ZN17hostdevice_vectorIPvED1Ev+0x67)[0x7f4da8894e67]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [12] /lib/libcudf.so(+0xe17c44)[0x7f4da7e20c44]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [13] /lib/libcudf.so(_ZN4cudf2io6detail7parquet6reader4impl19read_chunk_internalEb+0x37f)[0x7f4da8893e7f]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [14] /lib/libcudf.so(_ZN4cudf2io6detail7parquet6reader4impl4readEiibNS_9host_spanIKSt6vectorIiSaIiEELm18446744073709551615EEE+0x5c)[0x7f4da889488c]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [15] /lib/libcudf.so(_ZN4cudf2io6detail7parquet6reader4readERKNS0_22parquet_reader_optionsE+0x78)[0x7f4da888ef68]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [16] /lib/libcudf.so(_ZN4cudf2io12read_parquetERKNS0_22parquet_reader_optionsEPN3rmm2mr22device_memory_resourceE+0xc4)[0x7f4da87aafd4]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [17] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR17ParquetFileSource10read_groupEmPN3rmm2mr22device_memory_resourceE+0x322)[0x7f4e05311112]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [18] /usr/local/hugectr/lib/libhuge_ctr_shared.so(ZN7HugeCTR21RowGroupReadingThreadIxE18get_one_read_groupERKSt6vectorINS_21DataReaderSparseParamESaIS3_EERS2_ImSaImEERS2_IiSaIiEESD+0xa0)[0x7f4e05349b10]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [19] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR13core23_reader23ParquetDataReaderWorkerIxE6do_h2dEv+0x18a)[0x7f4e0533c58a]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [20] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0x5b7efe)[0x7f4e052d7efe]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [21] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f4e040d3de4]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [22] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f4e0d488609]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] [23] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f4e0d5c2133]
[jupyuterlab-nb-1692168896177-745f9b44d4-grkxn:17906] *** End of error message ***
client_loop: send disconnect: Broken pipe

gdb堆栈如下:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7ded859 in __GI_abort () at abort.c:79
#2 0x00007fffee991911 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007fffee99d38c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007fffee99c369 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007fffee99cd21 in __gxx_personality_v0 () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007fffee8e8bef in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7 0x00007fffee8e95aa in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8 0x00007fff928f882c in cudf::detail::throw_cuda_error(cudaError, char const*, unsigned int) () from /lib/libcudf.so
#9 0x00007fff9319877b in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<std::__detail::__variant::__variant_cookie ()(std::__detail::__variant::_Variant_storage<false, thrust::host_vector<void, std::allocator<void*> >, thrust::host_vector<void*, cudf::detail::pinned_allocator<void*> > >::_M_reset_impl()::{lambda(auto:1&&)#1}&&, std::variant<thrust::host_vector<void*, std::allocator<void*> >, thrust::host_vector<void*, cudf::detail::pinned_allocator<void*> > >&)>, std::tuple<std::variant<thrust::host_vector<void*, std::allocator<void*> >, thrust::host_vector<void*, cudf::detail::pinned_allocator<void*> > > >, std::integer_sequence<unsigned long, 1ul> >::__visit_invoke(std::__detail::__variant::_Variant_storage<false, thrust::host_vector<void*, std::allocator<void*> >, thrust::host_vector<void*, cudf::detail::pinned_allocator<void*> > >::M_reset_impl()::{lambda(auto:1&&)#1}, std::variant<thrust::host_vector<void*, std::allocator<void*> >, thrust::host_vector<void*, cudf::detail::pinned_allocator<void*> > >) () from /lib/libcudf.so
#10 0x00007fff93197e67 in hostdevice_vector<void*>::~hostdevice_vector() () from /lib/libcudf.so
#11 0x00007fff92723c44 in cudf::io::detail::parquet::reader::impl::decode_page_data(unsigned long, unsigned long) [clone .cold] ()
from /lib/libcudf.so
#12 0x00007fff93196e7f in cudf::io::detail::parquet::reader::impl::read_chunk_internal(bool) () from /lib/libcudf.so
#13 0x00007fff9319788c in cudf::io::detail::parquet::reader::impl::read(int, int, bool, cudf::host_span<std::vector<int, std::allocator > const, 18446744073709551615ul>) () from /lib/libcudf.so
#14 0x00007fff93191f68 in cudf::io::detail::parquet::reader::read(cudf::io::parquet_reader_options const&) () from /lib/libcudf.so
#15 0x00007fff930adfd4 in cudf::io::read_parquet(cudf::io::parquet_reader_options const&, rmm::mr::device_memory_resource*) ()
from /lib/libcudf.so
#16 0x00007fffefc07112 in HugeCTR::ParquetFileSource::read_group(unsigned long, rmm::mr::device_memory_resource*) ()
from /usr/local/hugectr/lib/libhuge_ctr_shared.so
#17 0x00007fffefc3fb10 in HugeCTR::RowGroupReadingThread::get_one_read_group(std::vector<HugeCTR::DataReaderSparseParam, std::allocatorHugeCTR::DataReaderSparseParam > const&, std::vector<unsigned long, std::allocator >&, std::vector<int, std::allocator >&, std::vector<int, std::allocator >&) () from /usr/local/hugectr/lib/libhuge_ctr_shared.so
#18 0x00007fffefc3258a in HugeCTR::core23_reader::ParquetDataReaderWorker::do_h2d() ()
from /usr/local/hugectr/lib/libhuge_ctr_shared.so
#19 0x00007fffefbcdefe in HugeCTR::producer_thread_func
(std::shared_ptrHugeCTR::IDataReaderWorker const&, std::shared_ptr<std::atomic > const&, int, bool volatile*) () from /usr/local/hugectr/lib/libhuge_ctr_shared.so
#20 0x00007fffee9c9de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007ffff7db0609 in start_thread (arg=) at pthread_create.c:477
#22 0x00007ffff7eea133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

环境信息:
nccl:
NCCL version 2.17.1+cuda12.1

hugectr 容器业务镜像版本:23.06,社区提供的,非客户编译的容器镜像。

如何复现?
keyset如果跑三个txt数据集就会出现踩内存的问题,如果只跑一个txt数据就没有问题。
我的数据如下:
测试数据目录keyset里如下:
0.keyset 1.keyset 2.keyset 3.keyset 4.keyset 5.keyset 6.keyset 7.keyset 8.keyset 9.keyset eval.txt
0.txt 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt

只要跑三个txt就有问题。

辛苦帮看看是不是因为不支持H800导致的,多谢了。

@dusir dusir added the question Further information is requested label Sep 7, 2023
@zehuanw zehuanw changed the title [Question] H800上跑hugectr dcn测试,遇到踩内存的问题 [Question] An illegal memory access was encountered on H800 & Hugectr dcn test Sep 7, 2023
@JacoCheung
Copy link
Collaborator

Hi @dusir , is this issue exactly same with #414 ? Can I assume the script in #414 is what caused the error?

From the log, I feel like there may be some misconfigurations on either the ETC or the dataset...

@JacoCheung
Copy link
Collaborator

JacoCheung commented Sep 8, 2023

I suggest doing following things step by step to narrow down the problem scope:

  1. If possible, can we run on other platforms to reproduce this error ? Say V100/A100
  2. Turn off etc to see if the problem is still there. (To turn off etc, all source files must be consolidated as one)
  3. (If in step 2, the problem is not solved) Use repeat mode instead of epoch mode (Setting the Solver::repeat_dataset=True will enable repeat mode, see doc )

@dusir
Copy link
Author

dusir commented Sep 8, 2023

I suggest doing following things step by step to narrow down the problem scope:

  1. If possible, can we run on other platforms to reproduce this error ? Say V100/A100
  2. Turn off etc to see if the problem is still there. (To turn off etc, all source files must be consolidated as one)
  3. (If in step 2, the problem is not solved) Use repeat mode instead of epoch mode (Setting the Solver::repeat_dataset=True will enable repeat mode, see doc )

I suggest doing following things step by step to narrow down the problem scope:

  1. If possible, can we run on other platforms to reproduce this error ? Say V100/A100
  2. Turn off etc to see if the problem is still there. (To turn off etc, all source files must be consolidated as one)
  3. (If in step 2, the problem is not solved) Use repeat mode instead of epoch mode (Setting the Solver::repeat_dataset=True will enable repeat mode, see doc )

Hi,
1.the V100 is ok,has no problem;
2.Turn off the etc,has no issue.
3.to set the opton Solver::repeat_dataset=True,we find an error:
image

@zehuanw zehuanw added bug It's a bug / potential bug and need verification stage::doing labels Sep 11, 2023
@viswa-nvidia viswa-nvidia added the P0 Must have label Sep 11, 2023
@JacoCheung JacoCheung self-assigned this Sep 18, 2023
@JacoCheung
Copy link
Collaborator

Close as it's fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug It's a bug / potential bug and need verification P0 Must have question Further information is requested stage::doing
Projects
None yet
Development

No branches or pull requests

4 participants