-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text-motion alignment pre-trained model #12
Comments
Or please allow me to ask some questions about this part. Reading your paper, I found that your model design is very similar with the work TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis. Then, can I say that your model is based on it, adding the ability to support hand motions and replacing MPNet with sBERT? Did you train on TMR's code framework? |
@Wenretium Thanks for your interest! Your comment is really in-depth and insightful. Most of the answers are right. However, we do not train our model in TMR's framework. We implemented the model by ourselves before they released codes. We plan to release these codes in about 2 weeks. You can try the demo at first. Best, Ling-Hao CHEN |
I get it. Thanks for your quick reply! |
Hi @Wenretium ! We release the TMR training in the [Note]: As the research target in this project is to clarify how to use text-motion alignment, TMR in our project is charged as TMA in the ICML-24 version. |
Thank you very much! You provided very detailed code documentation. |
@Wenretium welcome. any question, feel free to discuss! |
Hello! I have another question. Since you didn't provide a full demo for text-motion alignment loss, I added it based on my own understanding.
My question is: Did I load the pretrained model correctly? In HumanTOMATO, did you calculate the text-motion alignment loss by 'torch.mean(text_emb - motion_emb)'? |
@Wenretium Thanks for the reminder. I detail the implementation here. infoloss = InfoNCE(0.1)
filter_model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')
# if TMA supervision
if args.supervision:
# generated motions
all_supervise_motion = torch.cat(gen_supervise_tensor_list, dim = 0).cuda()
# motion length = token length * 4 (due to upsampling rate is 4)
full_m_tokens_len = (m_tokens_len.detach() * 4).tolist()
# get TMR_motion_embedding
TMR_motion_embedding = t2m_TMR_motionencoder(all_supervise_motion, full_m_tokens_len).loc
# get TMR_text_embedding
TMR_text_embedding = t2m_TMR_textencoder(texts).loc
with torch.no_grad():
text_embedding = filter_model.encode(texts)
text_embedding = torch.tensor(text_embedding).cuda()
normalized = F.normalize(text_embedding, p=2, dim=1)
# cos similarity
emb_dist = normalized.matmul(normalized.T)
loss_infonce = infoloss((TMR_motion_embedding, TMR_text_embedding), emb_dist)
all_loss = loss_cls + args.lambdainfo * loss_infonce Welcome any question. |
Hi @LinghaoChan , I would like to ask a followup question. I am using the trained the h3d checkpoint you provided to reproduce the Recall result. The chart below is the result I have. protocal A
M2T
protocal D
M2T
Since the *_embedding.npy files are not provided, I use the demo code to get all text_emb and motion_emb of the testset. Both of them are in shape of (~4000, 256). And in the retrieval code, I changed it to the following
What I did is simply remove the requirement of sbert. Am I missing something here? Retrain the TMA model is a bit costly to me. Looking forward to your reply! |
@no-Seaweed The difference between protocol A and B is the usage of sbert filtering or not. |
Thank you for your reply. I was able to get those embedding npy by executing
with minor modification in code. I got the following result:
M2T
protocal B
M2T
protocal D
M2T
Looks good, though the numbers are abit off comparing to the chart in the sup. |
@no-Seaweed Seems good. The jitters are normal. |
hi , i have a question, How loss_infonce performs gradient return?
|
Why is loss_infonce related to the vq? @xjli360 |
Because gpt predicts that code token requires vq-decoder to get motion, and then motion passes through t2m_TMR_motionencoder to get latent and music lantent to get the infonce loss. |
I see what you mean. When introducing the TMA supervision, it cannot accept the codes directly. It should process the motion decoded by the codes. As you know, directly using the max probability code is not differentiable. Therefore, we activate the GPT logits via gumbel softmax, not max. @xjli360 |
Hi! I am very interested in your work, especially the text-motion alignment pre-trained model. Hope to see your model and codes soon.
The text was updated successfully, but these errors were encountered: