Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is training from scratch possible now? #1283

Closed
Stamenov opened this issue Sep 18, 2019 · 9 comments
Closed

Is training from scratch possible now? #1283

Stamenov opened this issue Sep 18, 2019 · 9 comments
Labels

Comments

@Stamenov
Copy link

Do the models support training from scratch, together with original (paper) parameters?

@Zacharias030
Copy link

Zacharias030 commented Sep 18, 2019

You can just instanciate the models without the .from_pretraining() like so:

config = BertConfig(**optionally your favorite parameters**)
model = BertForPretraining(config)

I added a flag to run_lm_finetuning.py that gets checked in the main(). Maybe this snipped helps (note, I am only using this with Bert w/o next sentence prediction).

# check if instead initialize freshly
if args.do_fresh_init:
    config = config_class()
    tokenizer = tokenizer_class()
    if args.block_size <= 0:
        args.block_size = tokenizer.max_len  # Our input block size will be the max possible for the model
    args.block_size = min(args.block_size, tokenizer.max_len)
    model = model_class(config=config)
else:
    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path)
    if args.block_size <= 0:
        args.block_size = tokenizer.max_len  # Our input block size will be the max possible for the model
    args.block_size = min(args.block_size, tokenizer.max_len)
    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
model.to(args.device)

@Stamenov
Copy link
Author

Hi,

thanks for the quick response.
I am more interested in the XLNet and TransformerXL models. Would they have the same interface?

@Zacharias030
Copy link

Zacharias030 commented Sep 18, 2019 via email

@gooofy
Copy link

gooofy commented Sep 21, 2019

I think XLNet requires a very specific training procedure, see #943 👍

"For XLNet, the implementation in this repo is missing some key functionality (the permutation generation function and an analogue of the dataset record generator) which you'd have to implement yourself."

@p-stefanov
Copy link

#1283 (comment)

Hmm, tokenizers' constructors require a vocab_file parameter...

@stale
Copy link

stale bot commented Nov 21, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 21, 2019
@stale stale bot closed this as completed Nov 28, 2019
@jbmaxwell
Copy link

@Stamenov Did you figure out how to pretrain XLNet? I'm interested in that as well.

@Stamenov
Copy link
Author

No, I haven't. According to some recent tweet, huggingface could prioritize putting more effort into providing interfaces for self pre-training.

@julien-c
Copy link
Member

julien-c commented Feb 14, 2020

You can now leave --model_name_or_path to None in run_language_modeling.py to train a model from scratch.

See also https://huggingface.co/blog/how-to-train

julien-c added a commit to huggingface/blog that referenced this issue Feb 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants