posts/pytorch-train-keypoint-rcnn-tutorial/ #46

utterances-bot · 2024-02-01T12:32:00Z

Christian Mills - Training Keypoint R-CNN Models with PyTorch

Learn how to train Keypoint R-CNN models on custom datasets with PyTorch.

https://christianjmills.com/posts/pytorch-train-keypoint-rcnn-tutorial/

troymyname · 2024-02-01T12:32:01Z

Thank you for the amazing tutorials Chris!

zickezacke · 2024-02-23T03:08:17Z

Great tutorial Chris!
I tried to duplicate your approach with triggering the assert:

Loss is NaN or infinite at epoch 0, batch 0. Stopping training.

Thanks

cj-mills · 2024-02-23T19:28:58Z

Hi @zickezacke,

I verified the training code in the tutorial runs successfully on a CPU and a CUDA GPU this morning.

I forgot to include a version of the Jupyter Notebook for running on Windows, so I added that. Python multiprocessing works differently on Windows versus Linux, so the training code needs slight tweaks. Although, I don't believe that is the source of your issue, as that results in a different error.

I don't have a Mac, so I can't verify how the code runs there if that is what you are using.

Were you trying to implement the code manually? If so, try downloading and running the pre-completed training notebook to see if that runs successfully.

zickezacke · 2024-02-24T03:58:53Z

Hi @cj-mills ,

Thank you for your response. I appreciate it.
I tried running your script with a copy of the notebook with the same result.
The notebook is running in a WSL2 with CUDA. I do not know why the "nan" for the loss_item is occurring but going to research if I can figure it out.

Wish you a great weekend!

troymyname · 2024-03-31T11:30:45Z

Hi Chris, have you considered writing a tutorial to deploy the models? For instance, have you considered converting the model from ONNX to a format that can be used in a TensorRT environment? Thanks again for your efforts!

cj-mills · 2024-03-31T18:50:42Z

@troymyname Like this one?

Quantizing YOLOX with ONNX Runtime and TensorRT in Ubuntu

If so, I have been considering it for the other model tutorials. It's just a matter of finding the time to do those (and the other tutorials I've had planned for a while).

troymyname · 2024-04-02T09:22:32Z

@troymyname Like this one?

Quantizing YOLOX with ONNX Runtime and TensorRT in Ubuntu

If so, I have been considering it for the other model tutorials. It's just a matter of finding the time to do those (and the other tutorials I've had planned for a while).

@cj-mills That's correct. I am looking into solutions to quantize the model and prepare it for deployment. I have looked into several conversion pathways such as Torch --> TensorRT or Torch --> ONNX --> TensorRT. I have used the Polygraphy package from NVIDIA to prepare the model prior to conversion. However, I am running into issues at the moment. I will keep trying, and also look out for your post on how to do so for the KeyPoint RCNN model. Thanks!

zickezacke · 2024-04-02T13:20:24Z

I agree with you. Lots of moving pieces. While I was able to archive good results with the R-CNN, I started looking at the Fast R-CNN V3 because. It would reduce required features implementation and increase the overall performance.

…

On Tue, Apr 2, 2024, 5:23 AM Tonmoy Roy ***@***.***> wrote: @troymyname <https://github.com/troymyname> Like this one? - Quantizing YOLOX with ONNX Runtime and TensorRT in Ubuntu <https://christianjmills.com/posts/pytorch-train-object-detector-yolox-tutorial/ort-tensorrt-ubuntu/> If so, I have been considering it for the other model tutorials. It's just a matter of finding the time to do those (and the other tutorials I've had planned for a while). @cj-mills <https://github.com/cj-mills> That's correct. I am looking into solutions to quantize the model and prepare it for deployment. I have looked into several conversion pathways such as Torch --> TensorRT or Torch --> ONNX --> TensorRT. I have used the Polygraphy package from NVIDIA to prepare the model prior to conversion. However, I am running into issues at the moment. I will keep trying, and also look out for your post on how to do so for the KeyPoint RCNN model. Thanks! — Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZ6WL4POEEI66P3ZDVZLFDY3J2G3AVCNFSM6AAAAABCU2BTW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZRGQ4TOOBTGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ErikDerGute · 2024-04-11T11:31:42Z

Hey man, huge thank you for your amazing tutorial. I took your tutorial as a walkthrough to implement the keypointrccn for my custom application. I want to train the net on a custom dataset. The dataset contains n classes, each class contains m keypoints. However, sooner or later I will alway run into the same error: "keypoint_loss = F.cross_entropy(keypoint_logits[valid], keypoint_targets[valid])". Maybe you know where this error comes from. Thanks

cj-mills · 2024-04-11T18:47:39Z

Hi @ErikDerGute,

Would you mind adding the complete error statement?

The tutorial code does not currently support multiple object classes, so you would need to make some modifications.

First, the sample dataset used in the tutorial has the same number of object classes (one+background) as the dataset used to pre-train the model, so it only updates the keypoint predictor. You would also need to update the model's bounding box predictor to use it with a dataset containing multiple object classes.

Something like this:

from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

num_classes = 3  # Include background as a class, e.g., for 2 actual classes, this would be 3
in_features_box = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features_box, num_classes)

The current training loop assumes there is a single object class (excluding the background class), and hardcodes the values for gt_labels to 1 for the single object class. You would need to update the LabelMeKeypointDataset class to get the index values for the object classes and pass that to the training loop.

It sounds like the different objects classes all have the same number of keypoints. If not, you would need to make some additional changes.

When modifying the keypoint predictor, you would use the keypoint count from the object class with the highest number of keypoints.

Next, you would need to update the training code to address the fact the object classes in your dataset contain varying numbers of keypoints. The loss function expects the keypoint count specified for the keypoint predictor, which will be higher than some of the classes in your dataset.

You could probably address this using the visibility_mask, where the extra keypoints get marked as not visible for object classes with lower numbers of keypoints. Although, I have not tested this approach.

ErikDerGute · 2024-04-11T19:27:13Z

Thanks for your fast and detailed reply. For sure I adapted the num_classes, num_keypoints ... as you also mentioned above for my application. However, I could figure out the cause of the error by myself. It was completely my fault, as the model expects the keypoints to be formatted in target{}: [[[kp1_obj1], [kp2_obj1], [....]], [[kp1_ob2], [kp2_obj2], [...]]]. My keypoints were formatted like [[kp1_obj1, kp2_ob1, ...], [kp1_obj2, kp2_obj2, ...]]. Unfortunately I overlooked this little flaw for two days. Of course this format error causes big chaos, when trying to match the keypoint idxs in the keypointrcnn_loss function. The final result is the described index error.

Yogeshvasu · 2024-04-28T13:46:08Z

Thank you for the amazing tutorials Chris! , i need to convert this model to .ptl for deployment in android is it possible , i tried but i am getting error

import torch
import torchvision.models as models
from torch.utils.mobile_optimizer import optimize_for_mobile
import torchvision.transforms as transforms

# Load your PyTorch model (modify this based on your model architecture and loading method)
file_id = val_keys[0]

# Retrieve the image file path associated with the file ID
test_file = img_dict[file_id]

# Open the test file
input_img = Image.open(test_file).convert('RGB')

model = models.detection.keypointrcnn_resnet50_fpn(pretrained=False)

# Define the device where you want to run your model
device = torch.device('cpu')  # or 'cuda' if you have GPU

# Ensure the model is in evaluation mode and move it to the specified device
model.eval()
model.to(device)

# Define your input data and preprocessing pipeline
#input_img = test_img # Your input image (modify this based on your input data)
example = torch.rand(1, 3, 224, 224)
traced_script_module = torch.jit.trace(model, example)
traced_script_module_optimized = optimize_for_mobile(traced_script_module)
traced_script_module_optimized._save_for_lite_interpreter("model.ptl")

cj-mills · 2024-05-02T22:32:57Z

Hi @Yogeshvasu,

While I would recommend using torch.jit.script instead of torch.jit.trace to resolve the first error message your are likely getting, I don't believe the model is supported by _save_for_lite_interpreter, unfortunately.

Yogeshvasu · 2024-05-03T04:06:40Z

Hi Chris,

Thanks for your suggestion. i have seen conversion of your model to onnx , can this onnx model can be converted to Tflite since i need to check for deployment in android.

if possible could you please confirm since i am facing error .

from onnx_tf.backend import prepare
import onnx

onnx_model_path = 'model.onnx'
tf_model_path = 'model_tf'

onnx_model = onnx.load(onnx_model_path)
tf_rep = prepare(onnx_model)
tf_rep.export_graph(tf_model_path)

cj-mills · 2024-05-03T22:34:36Z

@Yogeshvasu Unfortunately, I don't believe it is supported by that method either. I'll probably end up replacing the Keypoint R-CNN model used in this tutorial with something that has more general compatibility at some point.

For now, if you just need a human keypoint estimation model for Tflite, checkout this page:

TensorFlow Lite - Pose estimation

Padma04 · 2025-02-04T20:10:43Z

Hi Chris ,
I am facing error during train_loop(model=model)
i am facing picklingError

---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
Cell In[41], line 1
----> 1 train_loop(model=model, 
      2            train_dataloader=train_dataloader,
      3            valid_dataloader=valid_dataloader,
      4            optimizer=optimizer, 
      5            lr_scheduler=lr_scheduler, 
      6            device=torch.device(device), 
      7            epochs=epochs, 
      8            checkpoint_path=checkpoint_path,
      9            use_scaler=True)

Cell In[36], line 34, in train_loop(model, train_dataloader, valid_dataloader, optimizer, lr_scheduler, device, epochs, checkpoint_path, use_scaler)
     31 # Loop over the epochs
     32 for epoch in tqdm(range(epochs), desc="Epochs"):
     33     # Run a training epoch and get the training loss
---> 34     train_loss = run_epoch(model, train_dataloader, optimizer, lr_scheduler, device, scaler, epoch, is_training=True)
     35     # Run an evaluation epoch and get the validation loss
     36     with torch.no_grad():

Cell In[30], line 27, in run_epoch(model, dataloader, optimizer, lr_scheduler, device, scaler, epoch_id, is_training)
     24 progress_bar = tqdm(total=len(dataloader), desc="Train" if is_training else "Eval")
     26 # Iterate over data batches
---> 27 for batch_id, (inputs, targets) in enumerate(dataloader):
     28     
     29     # Move inputs and targets to the specified device
     30     inputs = torch.stack(inputs).to(device)
     31     # Extract the ground truth bounding boxes and labels

File ~\miniforge3\envs\pytorch-env\Lib\site-packages\torch\utils\data\dataloader.py:479, in DataLoader.__iter__(self)
    477 if self.persistent_workers and self.num_workers > 0:
    478     if self._iterator is None:
--> 479         self._iterator = self._get_iterator()
    480     else:
    481         self._iterator._reset(self)

File ~\miniforge3\envs\pytorch-env\Lib\site-packages\torch\utils\data\dataloader.py:415, in DataLoader._get_iterator(self)
    413 else:
    414     self.check_worker_number_rationality()
--> 415     return _MultiProcessingDataLoaderIter(self)

File ~\miniforge3\envs\pytorch-env\Lib\site-packages\torch\utils\data\dataloader.py:1138, in _MultiProcessingDataLoaderIter.__init__(self, loader)
   1131 w.daemon = True
   1132 # NB: Process.start() actually take some time as it needs to
   1133 #     start a process and pass the arguments over via a pipe.
   1134 #     Therefore, we only add a worker to self._workers list after
   1135 #     it started, so that we do not call .join() if program dies
   1136 #     before it starts, and __del__ tries to join but will get:
   1137 #     AssertionError: can only join a started process.
-> 1138 w.start()
   1139 self._index_queues.append(index_queue)
   1140 self._workers.append(w)

File ~\miniforge3\envs\pytorch-env\Lib\multiprocessing\process.py:121, in BaseProcess.start(self)
    118 assert not _current_process._config.get('daemon'), \
    119        'daemonic processes are not allowed to have children'
    120 _cleanup()
--> 121 self._popen = self._Popen(self)
    122 self._sentinel = self._popen.sentinel
    123 # Avoid a refcycle if the target function holds an indirect
    124 # reference to the process object (see bpo-30775)

File ~\miniforge3\envs\pytorch-env\Lib\multiprocessing\context.py:224, in Process._Popen(process_obj)
    222 @staticmethod
    223 def _Popen(process_obj):
--> 224     return _default_context.get_context().Process._Popen(process_obj)

File ~\miniforge3\envs\pytorch-env\Lib\multiprocessing\context.py:336, in SpawnProcess._Popen(process_obj)
    333 @staticmethod
    334 def _Popen(process_obj):
    335     from .popen_spawn_win32 import Popen
--> 336     return Popen(process_obj)

File ~\miniforge3\envs\pytorch-env\Lib\multiprocessing\popen_spawn_win32.py:95, in Popen.__init__(self, process_obj)
     93 try:
     94     reduction.dump(prep_data, to_child)
---> 95     reduction.dump(process_obj, to_child)
     96 finally:
     97     set_spawning_popen(None)

File ~\miniforge3\envs\pytorch-env\Lib\multiprocessing\reduction.py:60, in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)

PicklingError: Can't pickle <function <lambda> at 0x000002B044F42B60>: attribute lookup <lambda> on __main__ failed

cj-mills · 2025-02-09T19:13:43Z

Hi @Padma04,

It appears you are following the tutorial on Windows. If so, please use the Windows version of the training code as noted in the Getting Started with the Code section of the tutorial:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

posts/pytorch-train-keypoint-rcnn-tutorial/ #46

posts/pytorch-train-keypoint-rcnn-tutorial/ #46

utterances-bot commented Feb 1, 2024

troymyname commented Feb 1, 2024

zickezacke commented Feb 23, 2024

cj-mills commented Feb 23, 2024

zickezacke commented Feb 24, 2024

troymyname commented Mar 31, 2024

cj-mills commented Mar 31, 2024

troymyname commented Apr 2, 2024

zickezacke commented Apr 2, 2024 via email

ErikDerGute commented Apr 11, 2024

cj-mills commented Apr 11, 2024

ErikDerGute commented Apr 11, 2024

Yogeshvasu commented Apr 28, 2024 •

edited by cj-mills

Loading

cj-mills commented May 2, 2024

Yogeshvasu commented May 3, 2024

cj-mills commented May 3, 2024

Padma04 commented Feb 4, 2025 •

edited by cj-mills

Loading

cj-mills commented Feb 9, 2025

posts/pytorch-train-keypoint-rcnn-tutorial/ #46

posts/pytorch-train-keypoint-rcnn-tutorial/ #46

Comments

utterances-bot commented Feb 1, 2024

Christian Mills - Training Keypoint R-CNN Models with PyTorch

troymyname commented Feb 1, 2024

zickezacke commented Feb 23, 2024

cj-mills commented Feb 23, 2024

zickezacke commented Feb 24, 2024

troymyname commented Mar 31, 2024

cj-mills commented Mar 31, 2024

troymyname commented Apr 2, 2024

zickezacke commented Apr 2, 2024 via email

ErikDerGute commented Apr 11, 2024

cj-mills commented Apr 11, 2024

ErikDerGute commented Apr 11, 2024

Yogeshvasu commented Apr 28, 2024 • edited by cj-mills Loading

cj-mills commented May 2, 2024

Yogeshvasu commented May 3, 2024

cj-mills commented May 3, 2024

Padma04 commented Feb 4, 2025 • edited by cj-mills Loading

cj-mills commented Feb 9, 2025

Yogeshvasu commented Apr 28, 2024 •

edited by cj-mills

Loading

Padma04 commented Feb 4, 2025 •

edited by cj-mills

Loading