-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Long Text Fine-Tuning Support #5532
base: main
Are you sure you want to change the base?
Conversation
_register_template glm4_long # TODO long-eos task, need your default_system on exists dataset propert 'system' keys # build inputs with format `<bos> X Y <eos>` # but labels with format ` Y ` not `<eos>` for long-eos task
- Implemented the `pack_data_preprocess` parameter to control input handling during training. - When set to `True`, it disables `cutoff_len` for truncating inputs, raising an error if the input exceeds the specified length. - Updated the frontend to reflect changes in the parameter's behavior. - Completed training of the full `longwriter-glm4-9b` model with the new configuration. - Included testing with the specified dataset to validate the implementation.
We are still verifying that the distribution repository does not have relevant code. Use the files in this compressed package to overwrite |
logger.warning(f"""cutoff_len {cutoff_len} is too small for the input turn_idx: {turn_idx}, drop it. | ||
eg: The eos_indice is exactly one less than the bubble length, causing the last one to be discarded. | ||
""") | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L59 raise exception.
curious why L66 just break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When pack_data_preprocess is true, cutoff_len is not used for truncating the input
pack_data_preprocess and len(source_ids)+len(target_ids) >= cutoff_len:
Used for verifying the maximum packing of long texts. For example, when the message length is >= 21, it should report an error instead of discarding the data if it doesn't form a complete training pack.
Cases where an error should be reported:
preprocess_packed_supervised_dataset receives batched data from dataset.map. When the number of processing threads is 1, only one process handles the data. The graph is too abstract; normally, it would be divided into batch_size pieces for all processes to handle.
dataset = dataset.map(
preprocess_func,
batched=True,
batch_size=data_args.preprocessing_batch_size,
remove_columns=column_names,
**kwargs,
)
src/llamafactory/data/template.py
Outdated
format_observation=StringFormatter(slots=["<|observation|>\n{{content}}<|assistant|>"]), | ||
format_tools=ToolFormatter(tool_format="glm4"), | ||
stop_words=["<|user|>", "<|observation|>"], | ||
# default_system= "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to remove commented codes..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
feat: Implement pack_data_preprocess parameter and integrate with frontend
Added
pack_data_preprocess
parameter to control input handling during training.True
, it disables the use ofcutoff_len
for truncating input, raising an error if the input exceeds the specified length.Updated frontend to reflect the changes in the parameter's behavior.
Completed training of the full
longwriter-glm4-9b
model with the new configuration.Included testing with the specified dataset to validate the implementation.
Before submitting