Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc' with streaming datasets #1741

Closed
mrbesher opened this issue Jun 15, 2024 · 3 comments · Fixed by #1899

Comments

@mrbesher
Copy link

mrbesher commented Jun 15, 2024

I encountered a TypeError when using streaming datasets, num_proc does not exist in IterableDataset.map().

Error logs:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 12
      5 dataset = load_dataset("Trelis/tiny-shakespeare", streaming=True)
      7 sft_config = SFTConfig(output_dir="output",
      8                        report_to="none",
      9                        dataset_text_field="Text",
     10                        max_seq_length=8)
---> 12 trainer = SFTTrainer(
     13         model_id,
     14         args=sft_config,
     15         train_dataset=dataset["train"],
     16         eval_dataset=dataset["test"]
     17     )
     18 trainer.train()

File /opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:101, in _deprecate_arguments.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     99         message += "\n\n" + custom_message
    100     warnings.warn(message, FutureWarning)
--> 101 return f(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:362, in SFTTrainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics, peft_config, dataset_text_field, packing, formatting_func, max_seq_length, infinite, num_of_sequences, chars_per_token, dataset_num_proc, dataset_batch_size, neftune_noise_alpha, model_init_kwargs, dataset_kwargs, eval_packing)
    360     args.dataset_kwargs = {}
    361 if train_dataset is not None:
--> 362     train_dataset = self._prepare_dataset(
    363         train_dataset,
    364         tokenizer,
    365         args.packing,
    366         args.dataset_text_field,
    367         args.max_seq_length,
    368         formatting_func,
    369         args.num_of_sequences,
    370         args.chars_per_token,
    371         remove_unused_columns=args.remove_unused_columns if args is not None else True,
    372         **args.dataset_kwargs,
    373     )
    374 if eval_dataset is not None:
    375     _multiple = isinstance(eval_dataset, dict)

File /opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:508, in SFTTrainer._prepare_dataset(self, dataset, tokenizer, packing, dataset_text_field, max_seq_length, formatting_func, num_of_sequences, chars_per_token, remove_unused_columns, append_concat_token, add_special_tokens, skip_prepare_dataset)
    505     return dataset
    507 if not packing:
--> 508     return self._prepare_non_packed_dataloader(
    509         tokenizer,
    510         dataset,
    511         dataset_text_field,
    512         max_seq_length,
    513         formatting_func,
    514         add_special_tokens,
    515         remove_unused_columns,
    516     )
    518 else:
    519     return self._prepare_packed_dataloader(
    520         tokenizer,
    521         dataset,
   (...)
    528         add_special_tokens,
    529     )

File /opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:576, in SFTTrainer._prepare_non_packed_dataloader(self, tokenizer, dataset, dataset_text_field, max_seq_length, formatting_func, add_special_tokens, remove_unused_columns)
    570 if not remove_unused_columns and len(extra_columns) > 0:
    571     warnings.warn(
    572         "You passed `remove_unused_columns=False` on a non-packed dataset. This might create some issues with the default collator and yield to errors. If you want to "
    573         f"inspect dataset other columns (in this case {extra_columns}), you can subclass `DataCollatorForLanguageModeling` in case you used the default collator and create your own data collator in order to inspect the unused dataset columns."
    574     )
--> 576 tokenized_dataset = dataset.map(
    577     tokenize,
    578     batched=True,
    579     remove_columns=dataset.column_names if remove_unused_columns else None,
    580     num_proc=self.dataset_num_proc,
    581     batch_size=self.dataset_batch_size,
    582 )
    584 return tokenized_dataset

TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc'

Reproduction Steps:

  1. Install trl, transformers, accelerate, bitsandbytes, and datasets using the following versions:
trl==0.9.4
transformers==4.41.2
accelerate==0.31.0
bitsandbytes==0.43.1
datasets==2.20.0
  1. Run the following code:
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

model_id = "/kaggle/working/toyllama"
dataset = load_dataset("Trelis/tiny-shakespeare", streaming=True)

sft_config = SFTConfig(output_dir="output",
                       report_to="none",
                       dataset_text_field="Text",
                       max_seq_length=8,
                       max_steps=10)

trainer = SFTTrainer(
        model_id,
        args=sft_config,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"]
    )
trainer.train()

Environment (probably not relevant):

  • Accelerate version: 0.31.0
  • Platform: Linux-5.15.133+-x86_64-with-glibc2.31
  • Python version: 3.10.13
  • Numpy version: 1.26.4
  • PyTorch version (GPU?): 2.1.2 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • PyTorch MLU available: False
  • System RAM: 31.36 GB
  • GPU type: Tesla T4
@maliozer
Copy link
Contributor

same issue, is there any solution for this?

@younesbelkada

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@qgallouedec qgallouedec reopened this Aug 5, 2024
@qgallouedec
Copy link
Member

Hey, thanks for reporting. I've implemented a fix in #1899
Contributions are welcome to propagate this change to the other trainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants