Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert in verify_correctness for mistral-7B #107

Open
abgoswam opened this issue Sep 1, 2024 · 3 comments
Open

Assert in verify_correctness for mistral-7B #107

abgoswam opened this issue Sep 1, 2024 · 3 comments

Comments

@abgoswam
Copy link

abgoswam commented Sep 1, 2024

I am following the getting started guide with mistal-7B model.

  • I am able to (1) convert mistralai/Mistral-7B-v0.1 and (2) pre-process the data.
  • However, i hit an error when trying to run the verify_correctness.py script.

It seems like an issue in building the dataset iterators before training. Any pointers ?

Here are my steps:

1. Converted Mistral-7B-v0.1 (works)

python hf_to_megatron.py --size 7 --out out_mistral_7b --model-path mistralai/Mistral-7B-v0.1 --cache-dir cache_mistral_7b mistral

2. Data-preprocessing (works)

python tools/preprocess_data.py \
        --input=./my_long_corpus_mistral/my_long_corpus_4096.jsonl \
        --output_prefix=my_long_corpus_4096_mistral \
        --vocab_file=./weights_conversion/out_mistral_7b/tokenizer.model \
        --tokenizer_type=SentencePieceTokenizer \
        --workers=2 \
        --dataset_impl=mmap \
        --chunk_size=32

3. Verify correctness of model conversion (FAILED)

DISTRIBUTED_ARGS="--nproc_per_node 1 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
LLAMA_ARGS="--use_rms_norm --glu_activation swiglu --no_tie_embed_logits --no_new_tokens --layernorm_epsilon 1e-5"
COMMON_ARGS="--hidden_dropout 0.0 --attention_dropout 0.0 --no_bias_gelu_fusion"

torchrun $DISTRIBUTED_ARGS verify_correctness.py \
	--model_name=mistral \
	--model_size=7 \
	--load=./weights_conversion/out_mistral_7b \
	--data_path=./my_long_corpus_mistral/my_long_corpus_4096_mistral_text_document \
	--tokenizer_type=SentencePieceTokenizer \
	--vocab_file=./weights_conversion/out_mistral_7b/tokenizer.model \
	--huggingface_cache=./weights_conversion/cache_mistral_7b/ \
	--huggingface_device=cuda:1 \
	$COMMON_ARGS $LLAMA_ARGS 

Assert:

I am hitting the following error:

 > loading shuffle-idx mapping from ./my_long_corpus_mistral/my_long_corpus_4096_mistral_text_document_valid_indexmap_100ns_32768sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of tokens: 1229100
    total number of samples: 113
    total number of epochs: 3
 > WARNING: could not find index map files, building the indices on rank 0 ...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/verify_correctness.py", line 217, in <module>
[rank0]:     main()
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/verify_correctness.py", line 179, in main
[rank0]:     data_iterator, _, _ = build_train_valid_test_data_iterators(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/training.py", line 911, in build_train_valid_test_data_iterators
[rank0]:     train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/finetune.py", line 179, in data_provider
[rank0]:     train_ds, valid_ds, test_ds = builder(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 35, in build_train_valid_test_datasets
[rank0]:     return _build_train_valid_test_datasets(data_prefix[0],
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 201, in _build_train_valid_test_datasets
[rank0]:     test_dataset = _f(2, 'test')
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 193, in _f
[rank0]:     dataset = GPTDataset(name, data_prefix,
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 234, in __init__
[rank0]:     self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 324, in _build_index_mappings
[rank0]:     assert last_epoch_num_samples < (num_samples_per_epoch + 1), \
[rank0]: AssertionError: last epoch number of samples exceeded max value.
E0901 18:13:33.273000 139940874703296 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1963981) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0a0+f70bd71a48.nv24.6', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
verify_correctness.py FAILED
------------------------------------------------------------
@abgoswam
Copy link
Author

abgoswam commented Sep 1, 2024

cc @martinjaggi

@abgoswam
Copy link
Author

abgoswam commented Sep 1, 2024

also i can repro the same error for meta-llama/Llama-2-7b-hf

DISTRIBUTED_ARGS="--nproc_per_node 1 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
LLAMA_ARGS="--use_rms_norm --glu_activation swiglu --no_tie_embed_logits --no_new_tokens --layernorm_epsilon 1e-5"
COMMON_ARGS="--hidden_dropout 0.0 --attention_dropout 0.0 --no_bias_gelu_fusion"

torchrun $DISTRIBUTED_ARGS verify_correctness.py \
	--model_name=llama2 \
	--model_size=7 \
	--load=./weights_conversion/out_llama2_7b \
	--data_path=./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document \
	--tokenizer_type=SentencePieceTokenizer \
	--vocab_file=./weights_conversion/out_llama2_7b/tokenizer.model \
	--huggingface_cache=./weights_conversion/cache_llama2_7b/ \
	--huggingface_device=cuda:1 \
	$COMMON_ARGS $LLAMA_ARGS 

Pasting full error:

/home/aiscuser/.local/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting seq_length to 4096 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to rotary from checkpoint
Setting bias_droput_fusion to False from checkpoint
Setting parallel_attn to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
WARNING: overriding default arguments for use_checkpoint_args:True                        with use_checkpoint_args:False
setting global batch size to 1
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... True
  attention_dropout ............................... 0.0
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  baseline_device ................................. cuda:1
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_droput_fusion .............................. False
  bias_gelu_fusion ................................ False
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  cache_dir ....................................... weights_conversion/cache_llama2_7b
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_impl ....................................... infer
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... ['./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  data_type ....................................... gpt
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 4096
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 100
  eval_only ....................................... False
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_signal_handler ............................. False
  ffn_hidden_size ................................. 11008
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_e4m3 ........................................ False
  fp8_hybrid ...................................... False
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 1
  glu_activation .................................. swiglu
  gradient_accumulation_fusion .................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.0
  hidden_size ..................................... 4096
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  iteration ....................................... release
  kv_channels ..................................... 128
  layernorm_epsilon ............................... 1e-05
  lima_dropout .................................... False
  load ............................................ ./weights_conversion/out_llama2_7b
  load_iters ...................................... None
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 1.0
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. None
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_prob ....................................... 0.15
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 4096
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  metrics ......................................... []
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mmap_warmup ..................................... False
  model_name ...................................... llama2
  model_size ...................................... 7
  model_type ...................................... encoder_or_decoder
  new_tokens ...................................... False
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  num_attention_heads ............................. 32
  num_attention_heads_kv .......................... 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  optimizer ....................................... adam
  override_opt_param_scheduler .................... False
  padded_vocab_size ............................... 32000
  parallel_attn ................................... False
  parallel_layernorm .............................. False
  params_dtype .................................... torch.float32
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... PositionEmbeddingType.rotary
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ 1
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  rope_scaling_factor ............................. 1.0
  rope_theta ...................................... 10000.0
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... None
  scalar_loss_mask ................................ 0.0
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 4096
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_iters ...................................... []
  sliding_window_size ............................. None
  split ........................................... 969, 30, 1
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  tie_embed_logits ................................ False
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. SentencePieceTokenizer
  train_data_path ................................. None
  train_iters ..................................... 10
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  use_bias ........................................ False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_contiguous_buffers_in_local_ddp ............. True
  use_cpu_initialization .......................... None
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. False
  use_one_sent_docs ............................... False
  use_post_ln ..................................... False
  use_ring_exchange_p2p ........................... False
  use_rms_norm .................................... True
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vocab_extra_ids ................................. 0
  vocab_extra_ids_list ............................ None
  vocab_file ...................................... ./weights_conversion/out_llama2_7b/tokenizer.model
  wandb_api_key ................................... None
  wandb_entity .................................... meditron
  wandb_id ........................................ None
  wandb_logger .................................... False
  wandb_name ...................................... None
  wandb_project ................................... None
  wandb_resume .................................... allow
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
> building SentencePieceTokenizer tokenizer ...
Special tokens: {'<s>': 1, '</s>': 2}
 > padded vocab (size: 32000) with 0 dummy tokens (new size: 32000)
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
Starting megatron vs huggingface verification
[rank0]:[W901 18:43:30.548352164 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Loading our model!
Getting megatron model...
Building model ...
/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/model/llama_model.py:36: UserWarning: Llama is not intended to use bias_dropout_fusion
  warnings.warn("Llama is not intended to use bias_dropout_fusion")
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 6738415616
/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/optimizer/optimizer.py:711: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:78.)
  self._scale = torch.cuda.FloatTensor([1.0])
> learning rate decay style: linear
node-0:2046259:2046259 [0] NCCL INFO Bootstrap : Using eth0:10.8.37.80<0>
node-0:2046259:2046259 [0] NCCL INFO cudaDriverVersion 12050
NCCL version 2.21.5+cuda12.5
node-0:2046259:2047044 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
node-0:2046259:2047044 [0] NCCL INFO P2P plugin IBext_v8
node-0:2046259:2047044 [0] NCCL INFO NET/IB : No device found.
node-0:2046259:2047044 [0] NCCL INFO NET/IB : No device found.
node-0:2046259:2047044 [0] NCCL INFO NET/Socket : Using [0]eth0:10.8.37.80<0>
node-0:2046259:2047044 [0] NCCL INFO Using non-device net plugin version 0
node-0:2046259:2047044 [0] NCCL INFO Using network Socket
node-0:2046259:2047044 [0] NCCL INFO ncclCommInitRank comm 0x55bd768d3e40 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0xd549aec3ffbca01b - Init START
node-0:2046259:2047044 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
node-0:2046259:2047044 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_width, ignoring
node-0:2046259:2047044 [0] NCCL INFO === System : maxBw 5000.0 totalBw 5000.0 ===
node-0:2046259:2047044 [0] NCCL INFO CPU/0-0 (1/2/-1)
node-0:2046259:2047044 [0] NCCL INFO + PCI[5000.0] - NIC/0-0
node-0:2046259:2047044 [0] NCCL INFO + PCI[24.0] - GPU/0-100000 (0)
node-0:2046259:2047044 [0] NCCL INFO ==========================================
node-0:2046259:2047044 [0] NCCL INFO GPU/100000 :GPU/0-100000 (0/5000.0/LOC) CPU/0-0 (1/24.0/PHB) 
node-0:2046259:2047044 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
node-0:2046259:2047044 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 40.000000/40.000000, type LOC/PIX, sameChannels 1
node-0:2046259:2047044 [0] NCCL INFO  0 : GPU/0
node-0:2046259:2047044 [0] NCCL INFO  1 : GPU/0
....
....
node-0:2046259:2047044 [0] NCCL INFO P2P Chunksize set to 131072
node-0:2046259:2047044 [0] NCCL INFO Connected all rings
node-0:2046259:2047044 [0] NCCL INFO Connected all trees
node-0:2046259:2047044 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
node-0:2046259:2047044 [0] NCCL INFO ncclCommInitRank comm 0x55bd768d3e40 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0xd549aec3ffbca01b - Init COMPLETE
 loading release checkpoint from ./weights_conversion/out_llama2_7b
 checkpoint version 3.0
  successfully loaded checkpoint from ./weights_conversion/out_llama2_7b at iteration 0
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:3199: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
(min, max) time across ranks (ms):
    load-checkpoint ................................: (5776.25, 5776.25)
Loading baseline model!
Getting huggingface model...

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:01<00:01,  1.47s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.16s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.21s/it]
Loading dataset!
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      10
    validation: 100
    test:       100
> building train, validation, and test datasets ...
Single data path provided for train, valid & test
 > building dataset index ...
    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
 > finished creating indexed dataset in 0.000472 seconds
    number of documents: 10000
    number of tokens: 1280000
 > dataset split:
    train:
     document indices in [0, 9690) total of 9690 documents
    validation:
     document indices in [9690, 9990) total of 300 documents
    test:
     document indices in [9990, 10000) total of 10 documents
node-0:2046259:2049109 [0] NCCL INFO Using non-device net plugin version 0
node-0:2046259:2049109 [0] NCCL INFO Using network Socket
node-0:2046259:2049109 [0] NCCL INFO bootstrapSplit: comm 0x55bd7bb635c0 parent 0x55bd768d3e40 rank 0 nranks 1 color -1091263299 key 0 prev 0 next 0 - DONE
node-0:2046259:2049109 [0] NCCL INFO ncclCommSplit comm 0x55bd7bb635c0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 parent 0x55bd768d3e40 color -1091263299 key 0 commId 0xa7fa1bf3e8499106 - Init START
node-0:2046259:2049109 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
node-0:2046259:2049109 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_width, ignoring
node-0:2046259:2049109 [0] NCCL INFO === System : maxBw 5000.0 totalBw 5000.0 ===
node-0:2046259:2049109 [0] NCCL INFO CPU/0-0 (1/2/-1)
node-0:2046259:2049109 [0] NCCL INFO + PCI[5000.0] - NIC/0-0
node-0:2046259:2049109 [0] NCCL INFO + PCI[24.0] - GPU/0-100000 (0)
node-0:2046259:2049109 [0] NCCL INFO ==========================================
node-0:2046259:2049109 [0] NCCL INFO GPU/100000 :GPU/0-100000 (0/5000.0/LOC) CPU/0-0 (1/24.0/PHB) 
node-0:2046259:2049109 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
node-0:2046259:2049109 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 40.000000/40.000000, type LOC/PIX, sameChannels 1
...
...
node-0:2046259:2049109 [0] NCCL INFO P2P Chunksize set to 131072
node-0:2046259:2049109 [0] NCCL INFO Connected all rings
node-0:2046259:2049109 [0] NCCL INFO Connected all trees
node-0:2046259:2049109 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
node-0:2046259:2049109 [0] NCCL INFO ncclCommSplit comm 0x55bd7bb635c0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 parent 0x55bd768d3e40 color -1091263299 key 0 commId 0xa7fa1bf3e8499106 - Init COMPLETE
node-0:2046259:2049114 [0] NCCL INFO Using non-device net plugin version 0
node-0:2046259:2049114 [0] NCCL INFO Using network Socket
node-0:2046259:2049114 [0] NCCL INFO bootstrapSplit: comm 0x55bd7bb6bec0 parent 0x55bd768d3e40 rank 0 nranks 1 color -1091263299 key 0 prev 0 next 0 - DONE
node-0:2046259:2049114 [0] NCCL INFO ncclCommSplit comm 0x55bd7bb6bec0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 parent 0x55bd768d3e40 color -1091263299 key 0 commId 0xa7fa1bf3e8499106 - Init START
node-0:2046259:2049114 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
node-0:2046259:2049114 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532304235/pci0001:00/0001:00:00.0/../max_link_width, ignoring
node-0:2046259:2049114 [0] NCCL INFO === System : maxBw 5000.0 totalBw 5000.0 ===
node-0:2046259:2049114 [0] NCCL INFO CPU/0-0 (1/2/-1)
node-0:2046259:2049114 [0] NCCL INFO + PCI[5000.0] - NIC/0-0
node-0:2046259:2049114 [0] NCCL INFO + PCI[24.0] - GPU/0-100000 (0)
node-0:2046259:2049114 [0] NCCL INFO ==========================================
node-0:2046259:2049114 [0] NCCL INFO GPU/100000 :GPU/0-100000 (0/5000.0/LOC) CPU/0-0 (1/24.0/PHB) 
node-0:2046259:2049114 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
node-0:2046259:2049114 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 40.000000/40.000000, type LOC/PIX, sameChannels 1
...
...
node-0:2046259:2049114 [0] NCCL INFO P2P Chunksize set to 131072
node-0:2046259:2049114 [0] NCCL INFO Connected all rings
node-0:2046259:2049114 [0] NCCL INFO Connected all trees
node-0:2046259:2049114 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
node-0:2046259:2049114 [0] NCCL INFO ncclCommSplit comm 0x55bd7bb6bec0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 parent 0x55bd768d3e40 color -1091263299 key 0 commId 0xa7fa1bf3e8499106 - Init COMPLETE
 > loading doc-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_train_indexmap_10ns_4096sl_1234s_doc_idx.npy
 > loading sample-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_train_indexmap_10ns_4096sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_train_indexmap_10ns_4096sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of tokens: 1240320
    total number of samples: 303
    total number of epochs: 1
 > loading doc-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_valid_indexmap_100ns_4096sl_1234s_doc_idx.npy
 > loading sample-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_valid_indexmap_100ns_4096sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from ./my_long_corpus_llama2/my_long_corpus_128_llama2_text_document_valid_indexmap_100ns_4096sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of tokens: 38400
    total number of samples: 104
    total number of epochs: 11
 > WARNING: could not find index map files, building the indices on rank 0 ...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/verify_correctness.py", line 217, in <module>
[rank0]:     main()
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/verify_correctness.py", line 179, in main
[rank0]:     data_iterator, _, _ = build_train_valid_test_data_iterators(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/training.py", line 911, in build_train_valid_test_data_iterators
[rank0]:     train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/finetune.py", line 179, in data_provider
[rank0]:     train_ds, valid_ds, test_ds = builder(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 35, in build_train_valid_test_datasets
[rank0]:     return _build_train_valid_test_datasets(data_prefix[0],
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 201, in _build_train_valid_test_datasets
[rank0]:     test_dataset = _f(2, 'test')
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 193, in _f
[rank0]:     dataset = GPTDataset(name, data_prefix,
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 234, in __init__
[rank0]:     self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
[rank0]:   File "/tmp/amlt-code-download/pfLLM-Megatron-LLM/megatron/data/gpt_dataset.py", line 324, in _build_index_mappings
[rank0]:     assert last_epoch_num_samples < (num_samples_per_epoch + 1), \
[rank0]: AssertionError: last epoch number of samples exceeded max value.
E0901 18:44:13.294000 140017019376064 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2046259) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0a0+f70bd71a48.nv24.6', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
verify_correctness.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-01_18:44:13
  host      : node-0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2046259)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@abgoswam
Copy link
Author

abgoswam commented Sep 2, 2024

it seems the error happens when building the index for "test"

setting the test split to 0 gets me past the assert

--split 950,50,0

with this the verification for llama2 works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant