Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to support agent chat sft. #572

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

lilongxian
Copy link

最近在基于GLM-4的智能体多轮对话任务训练过程中发现finetune.py脚本中的两个bug:

  1. finetune_demo/finetune.py中:
    process_batch函数中,以下代码部分存在漏洞:
    for message in conv:
    message = process_message(message)
    loss_mask_val = False if message['role'] in ('system', 'user', 'observation') else True
    new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]
    以及和process_batch_eval函数中以下代码部分存在漏洞:
    for message in conv:
    if len(input_ids) >= max_input_length:
    break
    else:
    message = process_message(message)
    new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]
    追溯apply_chat_template函数中结尾处代码:
    if tokenize:
    output = self.batch_encode_plus(
    [result] if isinstance(result[0], int) else result,
    padding=padding,
    truncation=truncation,
    max_length=max_length,
    return_tensors=return_tensors,
    is_split_into_words=True,
    add_special_tokens=False
    )
    if return_dict:
    return output
    else:
    return output["input_ids"]
    可知:
    tokenizer.apply_chat_template([message], tokenize=True, return_dict=False) 返回为具有一个向量的矩阵,此时直接的[2:]切片操作导致训练数据完全丢失,导致无法有效训练。而我们期望提取的数据是当前message编码后的向量数据(去掉[gMASK]),因此 解决办法是修改代码:
    new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]
    为:
    new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[0][2:]

修改后,多轮对话模型可正常训练。

  1. 当前finetune_demo/finetune.py脚本的实际代码不支持agent chat 训练数据,存在以下漏洞:
    从finetune_demo/finetune.py脚本中函数:
    def process_message(message):
    if 'tools' in message and message['role'] == 'system':
    for tool in message['tools']:
    parameters = tool['function']['parameters']['properties']
    tool['function']['parameters']['properties'] =
    {k: v for k, v in parameters.items() if
    v is not None}
    elif 'tools' in message:
    del message['tools']

以及 process_batch函数中:
else:
for message in conv:
message = process_message(message)
loss_mask_val = False if message['role'] in ('system', 'user', 'observation') else True
new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[0][2:]
input_ids += new_input_ids
loss_masks += [loss_mask_val] * len(new_input_ids)

可推出结论:
    现有程序没有对agent chat 训练中的assistant生成的 tool 信息的数据进行建模和转换,部分实际训练数据如下:
      {
  "role": "user",
  "content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."
},
{
  "role": "assistant",
  "content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"
},
{
  "role": "observation",
  "content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"
},
{
  "role": "assistant",
  "content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."
}

其中,assistant生成的 tool 信息为:
{
"role": "assistant",
"content": "{"name": "get_recommended_books", "arguments": {"interests": ["history", "science fiction"]}}"
},

根据GLM-4 agent chat 推理测试发现:GLM-4 生成的API任务规划调度数据格式为:functionName\n{"param":"p"}

继续追踪数据建模算法到函数apply_chat_template中:
input = self.build_single_message(
item["role"],
item.get("metadata", ""),
item["content"],
tokenize=tokenize
)
和函数:
def build_single_message(self, role, metadata, message, tokenize=True):
""" tokens of "" <|{role}|>{metadata}\n{message} """
assert role in ["system", "user", "assistant", "observation"], role
if tokenize:
role_tokens = [self.convert_tokens_to_ids(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n",
disallowed_special=())
message_tokens = self.tokenizer.encode(message, disallowed_special=())
tokens = role_tokens + message_tokens
return tokens
else:
return str(f"<|{role}|>{metadata}\n{message}")

可推出:
GLM-4的agent chat 训练数据建模模板为:
[gMASK]<|system|>\n tools messages<|user|>\nQ<|assistant|>\nPrefix response<|assistant|>functionName\n{"param":"p"}<|observation|>\n tool return<|assistant|>Answer

以上充分证明了:
agent chat 训练时,必须把assistant的tool信息
{
"role": "assistant",
"content": "{"name": "get_recommended_books", "arguments": {"interests": ["history", "science fiction"]}}"
}
转换为格式:
{
"role": role,
"metadata": function name,
"content": arguments {} 参数
}

解决办法:

 为了使代码整洁,在 finetune.py的process_message函数中追加代码:
# convert tarin data of agent chat.
if message['role'] == 'assistant':
    content = message['content']
    if isinstance(content, str) and content.startswith("{") and content.endswith("}"):
        try:
            content_ = eval(content)
            if isinstance(content_, dict) and "name" in content_ and "arguments" in content_:
                message['content'] = json.dumps(content_["arguments"], ensure_ascii=False)
                message['metadata'] = content_["name"]
        except:
            pass

完整函数如下:

def process_message(message):
if 'tools' in message and message['role'] == 'system':
for tool in message['tools']:
parameters = tool['function']['parameters']['properties']
tool['function']['parameters']['properties'] =
{k: v for k, v in parameters.items() if
v is not None}
elif 'tools' in message:
del message['tools']

# convert tarin data of agent chat.
if message['role'] == 'assistant':
    content = message['content']
    if isinstance(content, str) and content.startswith("{") and content.endswith("}"):
        try:
            content_ = eval(content)
            if isinstance(content_, dict) and "name" in content_ and "arguments" in content_:
                message['content'] = json.dumps(content_["arguments"], ensure_ascii=False)
                message['metadata'] = content_["name"]
        except:
            pass
return message

效果:成功完成了agent chat 多轮对话工具调用的训练。

@zRzRzRzRzRzRzR
Copy link
Member

[0] 的问题是你没有更新最新文件

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按照评论修改

@lilongxian
Copy link
Author

收到,谢谢!

@lilongxian
Copy link
Author

已修改,请老师查阅,谢谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants