CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6657

matt3o · 2023-06-21T13:02:47Z

matt3o
Jun 21, 2023

Describe the bug
I just tried to get some sample code for #6626 but ran into a warning I have seen many times before. The problem appears when the transform pushed code the GPU and the data is then handed over from the Dataloader Thread to the main Thread.
This is no hard bug but it is very annoying since it gets spammed a lot.
Temporary workaround which I found is to add "persistent_workers=True," to the DataLoader, then the warning gets only shown at the end of the program, sometimes never.

Warning message:

[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)

To Reproduce
Run this code, minimal sample:

import torch
from torch import optim, nn
from monai.engines import SupervisedTrainer
from monai.data import DataLoader, ArrayDataset
import gc
from monai.networks.nets import UNet
from monai.inferers import SimpleInferer, SlidingWindowInferer
from monai.networks.nets.dynunet import DynUNet

from monai.engines import SupervisedEvaluator, SupervisedTrainer
import monai.transforms as mt


NETWORK_INPUT_SHAPE = (1, 128, 128, 256)
NUM_IMAGES = 50

def get_xy():
    xs = [256 * torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    ys = [torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    return xs, ys

transform = mt.Compose([
    mt.ToDevice(device="cuda")
])

def get_data_loader():
    x, y = get_xy()
    dataset = ArrayDataset(x, seg=y, img_transform=transform, seg_transform=transform)

    loader = DataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn')
    return loader


def get_model():
    return DynUNet(
            spatial_dims=3,
            in_channels=1,
            out_channels=1,
            kernel_size=[3, 3, 3, 3, 3 ,3],
            strides=[1, 2, 2, 2, 2, [2, 2, 1]],
            upsample_kernel_size=[2, 2, 2, 2, [2, 2, 1]],
            norm_name="instance",
            deep_supervision=False,
            res_block=True,
).to(device=device)

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader = get_data_loader()
    model = get_model()
    MAX_EPOCHS = 2

    optimizer = optim.Adam(model.parameters())
    inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")

    trainer = SupervisedTrainer(
        device=device,
        max_epochs=MAX_EPOCHS,
        amp=True,
        train_data_loader=train_loader,
        network=model,
        optimizer=optimizer,
        inferer=inferer,
        loss_function=nn.CrossEntropyLoss(),
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
    )

    trainer.run()

Expected behavior
No Cuda Warnings

Environment

Verified on different environments.

================================
Printing MONAI config...
================================
MONAI version: 1.1.0
Numpy version: 1.23.5
Pytorch version: 1.13.1+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: a2ec3752f54bfc3b40e7952234fbeb5452ed63e3
MONAI __file__: /home/matteo/anaconda3/envs/monai/lib/python3.9/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.10
Nibabel version: 5.0.1
scikit-image version: 0.20.0
Pillow version: 9.5.0
Tensorboard version: 2.12.1
gdown version: 4.7.1
TorchVision version: 0.14.0+cu117
tqdm version: 4.64.1
lmdb version: 1.4.0
psutil version: 5.9.4
pandas version: 1.5.3
einops version: 0.6.0
transformers version: 4.21.3
mlflow version: 2.2.2
pynrrd version: 1.0.0

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies


================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 22.04.2 LTS
Platform: Linux-5.19.0-45-generic-x86_64-with-glibc2.35
Processor: x86_64
Machine: x86_64
Python version: 3.9.16
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 12
Num logical CPUs: 24
Num usable CPUs: 24
CPU usage (%): [4.1, 3.6, 4.2, 3.6, 3.6, 3.7, 3.6, 3.1, 3.6, 4.1, 5.2, 99.5, 4.1, 3.6, 4.6, 3.6, 3.6, 3.6, 3.6, 3.6, 3.6, 5.2, 3.6, 4.2]
CPU freq. (MHz): 3687
Load avg. in last 1, 5, 15 mins (%): [6.3, 7.6, 7.5]
Disk usage (%): 24.8
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 31.2
Available memory (GB): 26.9
Used memory (GB): 3.9

================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 11.7
cuDNN enabled: True
cuDNN version: 8500
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
GPU 0 Name: NVIDIA GeForce RTX 3090 Ti
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 84
GPU 0 Total memory (GB): 22.2
GPU 0 CUDA capability (maj.min): 8.6

Additional context
Adding an evaluator further complicates the warnings and a new warning is now shown:

[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

The code for that:

import torch
from torch import optim, nn
from monai.engines import SupervisedTrainer
from monai.data import DataLoader, ArrayDataset
import gc
from monai.networks.nets import UNet
from monai.inferers import SimpleInferer, SlidingWindowInferer
from monai.networks.nets.dynunet import DynUNet

from monai.handlers import (
    CheckpointSaver,
    LrScheduleHandler,
    MeanDice,
    StatsHandler,
    TensorBoardStatsHandler,
    ValidationHandler,
    from_engine,
    GarbageCollector,
)
from monai.engines import SupervisedEvaluator, SupervisedTrainer
import monai.transforms as mt


NETWORK_INPUT_SHAPE = (1, 128, 128, 256)
NUM_IMAGES = 50

def get_xy():
    xs = [256 * torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    ys = [torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    return xs, ys

transform = mt.Compose([
    mt.ToDevice(device="cuda")
])

def get_data_loader():
    x, y = get_xy()
    dataset = ArrayDataset(x, seg=y, img_transform=transform, seg_transform=transform)

    loader = DataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn')
    return loader


def get_model():
    return DynUNet(
            spatial_dims=3,
            # 1 dim for the image, the other ones for the signal per label with is the size of image
            in_channels=1,
            out_channels=1,
            kernel_size=[3, 3, 3, 3, 3 ,3],
            strides=[1, 2, 2, 2, 2, [2, 2, 1]],
            upsample_kernel_size=[2, 2, 2, 2, [2, 2, 1]],
            norm_name="instance",
            deep_supervision=False,
            res_block=True,
).to(device=device)

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader = get_data_loader()
    model = get_model()
    MAX_EPOCHS = 2

    optimizer = optim.Adam(model.parameters())
    inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")
    val_inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")

    val_handlers = [
        StatsHandler(output_transform=lambda x: None),
    ]

    evaluator = SupervisedEvaluator(
        device=device,
        amp=True,
        val_data_loader=train_loader,
        network=model,
        inferer=val_inferer,
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
        val_handlers = val_handlers,
    )
    lr_scheduler =  torch.optim.lr_scheduler.PolynomialLR(optimizer, total_iters=MAX_EPOCHS, power = 2)
    train_handlers = [
        ValidationHandler(
            validator=evaluator, interval=1, epoch_level=True,
        ),
        LrScheduleHandler(lr_scheduler=lr_scheduler, 
                print_lr=True,
        ),

    ]

    trainer = SupervisedTrainer(
        device=device,
        max_epochs=MAX_EPOCHS,
        amp=True,
        train_data_loader=train_loader,
        network=model,
        optimizer=optimizer,
        inferer=inferer,
        loss_function=nn.CrossEntropyLoss(),
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
        train_handlers=train_handlers,
    )
    trainer.run()

wyli · 2023-06-22T10:15:03Z

wyli
Jun 22, 2023
Collaborator

Thanks for reporting this, I think the multiprocessing_context='spawn' creates and removes new processes during training, this actually introduces overheads.

Have you tried monai.data.ThreadLoader, this is often better than the multiprocessing loader when handling transforms on GPU:

loader = monai.data.ThreadDataLoader(dataset, num_workers=0, batch_size=1)

or

loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1, use_thread_workers=True)

0 replies

matt3o · 2023-06-22T12:12:47Z

matt3o
Jun 22, 2023
Author

@wyli Thanks for the quick reply!
Yes, ThreadDataloader without spawn (loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1)) I get this error:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

With spawn (loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn')) I get exactly the same results:

2023-06-22 13:54:36,946 - Engine run resuming from iteration 0, epoch 0 until 2 epochs
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
....
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

With loader = ThreadDataLoader(dataset, num_workers=0, batch_size=1, multiprocessing_context='spawn') I get
ValueError: multiprocessing_context can only be used with multi-process loading (num_workers > 0), but got num_workers=0

Working results with loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn', use_thread_workers=True)
Kind of confusing. I was playing around a bit with the different Dataloaders and options but could not find a working solution. So thanks for your help, this is a working solution!

0 replies

matt3o · 2023-06-22T12:54:54Z

matt3o
Jun 22, 2023
Author

@wyli This topic is sadly not resolved however. Using ThreadDataloader modifies the logic of my real code, I believe the affine matrix for the spatial transforms is getting lost in the process.

The transforms, similar to DeepEdit, look like this:

            # Initial transforms on the CPU which does not hurt since they are executed asynchronously and only once
            InitLoggerd(args), # necessary if the dataloader runs in an extra thread / process
            LoadImaged(keys=("image", "label"), reader="ITKReader"),
            EnsureChannelFirstd(keys=("image", "label")),
            NormalizeLabelsInDatasetd(keys="label", label_names=labels, device=device),
            Orientationd(keys=["image", "label"], axcodes="RAS"),
            Spacingd(keys=["image", "label"], pixdim=spacing),
            CropForegroundd(keys=("image", "label"), source_key="image", select_fn=threshold_foreground),
            ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True), # 0.05 and 99.95 percentiles of the spleen HUs

            ### Random Transforms ###
            RandCropByPosNegLabeld(keys=("image", "label"), label_key="label", spatial_size=args.train_crop_size, pos=0.6, neg=0.4) if args.train_crop_size is not None else NoOpd(),
            DivisiblePadd(keys=["image", "label"], k=64, value=0) if args.inferer == "SimpleInferer" else NoOpd(), # UNet needs this
            RandFlipd(keys=("image", "label"), spatial_axis=[0], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[1], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[2], prob=0.10),
            RandRotate90d(keys=("image", "label"), prob=0.10, max_k=3),
            
            # Move to GPU
            ToTensord(keys=("image", "label"), device=device, track_meta=False),

With the Dataloader the resulting shape is torch.Size([1, 3, 224, 224, 320])
With ThreadDataloader the result is torch.Size([1, 3, 169, 169, 109]).
The following message is printed when using ThreadDataloader:

`data_array` is not of type `MetaTensor, assuming affine to be identity.
`data_array` is not of type MetaTensor, assuming affine to be identity.

I have run into this error before and I believe this is due to Spacingd(keys=["image", "label"], pixdim=spacing), not finding the affine matrix, but I have no idea why this happens with ThreadDataloader but not with the normal Dataloader. Any insight here would be cool since this makes the ThreadDataloader solution unusable for me.

0 replies

wyli · 2023-06-22T12:56:10Z

wyli
Jun 22, 2023
Collaborator

yes, please use track_meta=True in the ToTensord

0 replies

matt3o · 2023-06-22T13:01:00Z

matt3o
Jun 22, 2023
Author

Does not change anything. Do I have to call that sooner? The ToTensord call is after all the relevant transforms imo.

0 replies

matt3o · 2023-06-22T13:03:37Z

matt3o
Jun 22, 2023
Author

Same problem, even with ToTensord right at the start:

            InitLoggerd(args), # necessary if the dataloader runs in an extra thread / process
            LoadImaged(keys=("image", "label"), reader="ITKReader"),
            ToTensord(keys=("image", "label"), device=device, track_meta=True),
            EnsureChannelFirstd(keys=("image", "label")),
            NormalizeLabelsInDatasetd(keys="label", label_names=labels, device=device),
            Orientationd(keys=["image", "label"], axcodes="RAS"),
            Spacingd(keys=["image", "label"], pixdim=spacing),
            CropForegroundd(keys=("image", "label"), source_key="image", select_fn=threshold_foreground),
            ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True), # 0.05 and 99.95 percentiles of the spleen HUs

0 replies

matt3o · 2023-06-22T13:07:17Z

matt3o
Jun 22, 2023
Author

I just changed my code to Dataset instead of PersistentDataset, just to be sure this is no caching effect. Same results on Dataset

0 replies

wyli · 2023-06-22T13:11:51Z

wyli
Jun 22, 2023
Collaborator

ok... perhaps this NormalizeLabelsInDatasetd implicitly converts a metatensor into a torch tensor and removes the metadata. could you please try dropping that transform to confirm? I can also look into this soon.

1 reply

diazandr3s Jun 27, 2023
Collaborator

You don't need this transform (NormalizeLabelsInDatasetd) if you are NOT doing multiple-label segmentation. From what I can see on the network definition (out_channels=1), this is a binary segmentation task.

This transform reassigns label indexes when the indexes are not originally sequential: https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/deepedit/transforms.py#L86-L120

matt3o · 2023-06-22T13:15:34Z

matt3o
Jun 22, 2023
Author

Same result, so the transforms still don't work as expected, plus the code then crashed later of course since information is missing. The code is extremely similar to this one https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/deepedit/transforms.py#L86 , so it should check if the tensor is normal or a MetaTensor

0 replies

matt3o · 2023-06-22T13:31:02Z

matt3o
Jun 22, 2023
Author

@wyli Would be great if you find the time to check that out 😊 As I said, with Dataloader it works, with ThreadDataloader it doesn't. Just for completeness I'll paste the calling code for both below.

-    train_loader = DataLoader(
-        train_ds, shuffle=True, num_workers=args.num_workers, batch_size=1, multiprocessing_context='spawn', persistent_workers=True,
+    train_loader = ThreadDataLoader(
+        train_ds, shuffle=True, num_workers=args.num_workers, batch_size=1, multiprocessing_context='spawn', use_thread_workers=True#, persistent_workers=True,

0 replies

wyli · 2023-06-22T13:37:27Z

wyli
Jun 22, 2023
Collaborator

sure, I forgot to mention that ThreadDataLoader with num_workers greater than 1 may have some problems because some transforms are not thread-safe. DataLoader on the other hand can work with more than 1 processes.

the issue of data_array is not of type MetaTensor, assuming affine to be identity. looks like a separate issue, I'll have a look..

0 replies

matt3o · 2023-06-22T13:44:22Z

matt3o
Jun 22, 2023
Author

Ah no worries there, I am using args.num_workers==1 per default. Good to know anyways, then I won't increase it. I found old code mentioning setting num_workers to 0 but that no longer works.

0 replies

wyli · 2023-06-26T09:05:15Z

wyli
Jun 26, 2023
Collaborator

@wyli This topic is sadly not resolved however. Using ThreadDataloader modifies the logic of my real code, I believe the affine matrix for the spatial transforms is getting lost in the process.

The transforms, similar to DeepEdit, look like this:

            # Initial transforms on the CPU which does not hurt since they are executed asynchronously and only once
            InitLoggerd(args), # necessary if the dataloader runs in an extra thread / process
            LoadImaged(keys=("image", "label"), reader="ITKReader"),
            EnsureChannelFirstd(keys=("image", "label")),
            NormalizeLabelsInDatasetd(keys="label", label_names=labels, device=device),
            Orientationd(keys=["image", "label"], axcodes="RAS"),
            Spacingd(keys=["image", "label"], pixdim=spacing),
            CropForegroundd(keys=("image", "label"), source_key="image", select_fn=threshold_foreground),
            ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True), # 0.05 and 99.95 percentiles of the spleen HUs

            ### Random Transforms ###
            RandCropByPosNegLabeld(keys=("image", "label"), label_key="label", spatial_size=args.train_crop_size, pos=0.6, neg=0.4) if args.train_crop_size is not None else NoOpd(),
            DivisiblePadd(keys=["image", "label"], k=64, value=0) if args.inferer == "SimpleInferer" else NoOpd(), # UNet needs this
            RandFlipd(keys=("image", "label"), spatial_axis=[0], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[1], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[2], prob=0.10),
            RandRotate90d(keys=("image", "label"), prob=0.10, max_k=3),
            
            # Move to GPU
            ToTensord(keys=("image", "label"), device=device, track_meta=False),

With the Dataloader the resulting shape is torch.Size([1, 3, 224, 224, 320]) With ThreadDataloader the result is torch.Size([1, 3, 169, 169, 109]). The following message is printed when using ThreadDataloader:

`data_array` is not of type `MetaTensor, assuming affine to be identity.
`data_array` is not of type MetaTensor, assuming affine to be identity.

I have run into this error before and I believe this is due to Spacingd(keys=["image", "label"], pixdim=spacing), not finding the affine matrix, but I have no idea why this happens with ThreadDataloader but not with the normal Dataloader. Any insight here would be cool since this makes the ThreadDataloader solution unusable for me.

Hi @diazandr3s I'm not sure about the root cause of this deepedit transform + ThreadDataloader issue, please have a look if you have time.. thanks! (converting this to a discussion for now, please feel free to create a bug report if it's not a usage question)

0 replies

matt3o · 2023-06-27T07:39:35Z

matt3o
Jun 27, 2023
Author

Since changing one single line, DataLoader to ThreadDataLoader, changes how transforms work and renders ThreadDataLoader unusable to me I would consider this to be a bug.
It did some more debugging and found out what causes this issue is the line use_thread_workers=True. Without this setting the code runs fine even with ThreadDataloader.

I will paste below how the different batchdata's look after the LoadImaged() call (done with PrintDatad(), a transform I added). Should be easy to debug for the person who created the code. The problem is that with use_thread_workers=True, Tensors are returned instead of MetaTensors, which means the Meta dicts get lost. I guess this is the dict used by the other transforms and they just ignore the other setting where this is information is: image_meta_dict/original_affine

with use_thread_workers=True

[2023-06-27 09:27:32.319][INFO] __call__ - Type of batch data: <class 'dict'>
- image(Tensor) size: torch.Size([400, 400, 284]) size in MB: 173.33984375MB device: cpu dtype: torch.float32 
- label(Tensor) size: torch.Size([400, 400, 284]) size in MB: 173.33984375MB device: cpu dtype: torch.float32 
- image_meta_dict(dict)
    - image_meta_dict/aux_file:  
    - image_meta_dict/bitpix: 32 
    - image_meta_dict/cal_max: 0 
    - image_meta_dict/cal_min: 0 
    - image_meta_dict/datatype: 16 
    - image_meta_dict/descrip:  
    - image_meta_dict/dim[0]: 3 
    - image_meta_dict/dim[1]: 400 
    - image_meta_dict/dim[2]: 400 
    - image_meta_dict/dim[3]: 284 
    - image_meta_dict/dim[4]: 1 
    - image_meta_dict/dim[5]: 1 
    - image_meta_dict/dim[6]: 1 
    - image_meta_dict/dim[7]: 1 
    - image_meta_dict/dim_info: 0 
    - image_meta_dict/intent_code: 0 
    - image_meta_dict/intent_name:  
    - image_meta_dict/intent_p1: 0 
    - image_meta_dict/intent_p2: 0 
    - image_meta_dict/intent_p3: 0 
    - image_meta_dict/nifti_type: 1 
    - image_meta_dict/pixdim[0]: 0 
    - image_meta_dict/pixdim[1]: 2.03642 
    - image_meta_dict/pixdim[2]: 2.03642 
    - image_meta_dict/pixdim[3]: 3 
    - image_meta_dict/pixdim[4]: 1 
    - image_meta_dict/pixdim[5]: 1 
    - image_meta_dict/pixdim[6]: 1 
    - image_meta_dict/pixdim[7]: 1 
    - image_meta_dict/qfac: 0.0 
    - image_meta_dict/qform_code: 0 
    - image_meta_dict/qform_code_name: NIFTI_XFORM_UNKNOWN 
    - image_meta_dict/qoffset_x: 0 
    - image_meta_dict/qoffset_y: 0 
    - image_meta_dict/qoffset_z: 0 
    - image_meta_dict/qto_xyz: [[2.03642 0.      0.      0.     ]
 [0.      2.03642 0.      0.     ]
 [0.      0.      3.      0.     ]
 [0.      0.      0.      1.     ]] 
    - image_meta_dict/quatern_b: 0 
    - image_meta_dict/quatern_c: 0 
    - image_meta_dict/quatern_d: 0 
    - image_meta_dict/scl_inter: 0 
    - image_meta_dict/scl_slope: 1 
    - image_meta_dict/sform_code: 2 
    - image_meta_dict/sform_code_name: NIFTI_XFORM_ALIGNED_ANAT 
    - image_meta_dict/slice_code: 0 
    - image_meta_dict/slice_duration: 0 
    - image_meta_dict/slice_end: 0 
    - image_meta_dict/slice_start: 0 
    - image_meta_dict/srow_x: -2.03642 0 0 406.738 
    - image_meta_dict/srow_y: -0 2.03642 0 -242.049 
    - image_meta_dict/srow_z: 0 -0 3 -1454.5 
    - image_meta_dict/toffset: 0 
    - image_meta_dict/vox_offset: 352 
    - image_meta_dict/xyzt_units: 0 
    - image_meta_dict/spacing: [2.03642011 2.03642011 3.        ] 
    - image_meta_dict/original_affine: [[-2.03642011e+00  0.00000000e+00  0.00000000e+00  4.06738007e+02]
 [ 0.00000000e+00  2.03642011e+00  0.00000000e+00 -2.42048584e+02]
 [ 0.00000000e+00  0.00000000e+00  3.00000000e+00 -1.45450000e+03]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00]] 
    - image_meta_dict/space: RAS 
    - image_meta_dict/affine: [[-2.03642011e+00  0.00000000e+00  0.00000000e+00  4.06738007e+02]
 [ 0.00000000e+00  2.03642011e+00  0.00000000e+00 -2.42048584e+02]
 [ 0.00000000e+00  0.00000000e+00  3.00000000e+00 -1.45450000e+03]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00]] 
    - image_meta_dict/spatial_shape: [400 400 284] 
    - image_meta_dict/original_channel_dim: no_channel 
    - image_meta_dict/filename_or_obj: /cvhci/data/AutoPET/AutoPET/imagesTr/tumor_100.nii.gz 
- label_meta_dict(dict)
    - label_meta_dict/aux_file:  
    - label_meta_dict/bitpix: 8 
    - label_meta_dict/cal_max: 0 
    - label_meta_dict/cal_min: 0 
    - label_meta_dict/datatype: 2 
    - label_meta_dict/descrip:  
    - label_meta_dict/dim[0]: 3 
    - label_meta_dict/dim[1]: 400 
    - label_meta_dict/dim[2]: 400 
    - label_meta_dict/dim[3]: 284 
    - label_meta_dict/dim[4]: 1 
    - label_meta_dict/dim[5]: 1 
    - label_meta_dict/dim[6]: 1 
    - label_meta_dict/dim[7]: 1 
    - label_meta_dict/dim_info: 0 
    - label_meta_dict/intent_code: 0 
    - label_meta_dict/intent_name:  
    - label_meta_dict/intent_p1: 0 
    - label_meta_dict/intent_p2: 0 
    - label_meta_dict/intent_p3: 0 
    - label_meta_dict/nifti_type: 1 
    - label_meta_dict/pixdim[0]: 0 
    - label_meta_dict/pixdim[1]: 2.03642 
    - label_meta_dict/pixdim[2]: 2.03642 
    - label_meta_dict/pixdim[3]: 3 
    - label_meta_dict/pixdim[4]: 1 
    - label_meta_dict/pixdim[5]: 1 
    - label_meta_dict/pixdim[6]: 1 
    - label_meta_dict/pixdim[7]: 1 
    - label_meta_dict/qfac: 0.0 
    - label_meta_dict/qform_code: 0 
    - label_meta_dict/qform_code_name: NIFTI_XFORM_UNKNOWN 
    - label_meta_dict/qoffset_x: 0 
    - label_meta_dict/qoffset_y: 0 
    - label_meta_dict/qoffset_z: 0 
    - label_meta_dict/qto_xyz: [[2.03642 0.      0.      0.     ]
 [0.      2.03642 0.      0.     ]
 [0.      0.      3.      0.     ]
 [0.      0.      0.      1.     ]] 
    - label_meta_dict/quatern_b: 0 
    - label_meta_dict/quatern_c: 0 
    - label_meta_dict/quatern_d: 0 
    - label_meta_dict/scl_inter: 0 
    - label_meta_dict/scl_slope: 1 
    - label_meta_dict/sform_code: 2 
    - label_meta_dict/sform_code_name: NIFTI_XFORM_ALIGNED_ANAT 
    - label_meta_dict/slice_code: 0 
    - label_meta_dict/slice_duration: 0 
    - label_meta_dict/slice_end: 0 
    - label_meta_dict/slice_start: 0 
    - label_meta_dict/srow_x: -2.03642 0 0 406.738 
    - label_meta_dict/srow_y: -0 2.03642 0 -242.049 
    - label_meta_dict/srow_z: 0 -0 3 -1454.5 
    - label_meta_dict/toffset: 0 
    - label_meta_dict/vox_offset: 352 
    - label_meta_dict/xyzt_units: 0 
    - label_meta_dict/spacing: [2.03642011 2.03642011 3.        ] 
    - label_meta_dict/original_affine: [[-2.03642011e+00  0.00000000e+00  0.00000000e+00  4.06738007e+02]
 [ 0.00000000e+00  2.03642011e+00  0.00000000e+00 -2.42048584e+02]
 [ 0.00000000e+00  0.00000000e+00  3.00000000e+00 -1.45450000e+03]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00]] 
    - label_meta_dict/space: RAS 
    - label_meta_dict/affine: [[-2.03642011e+00  0.00000000e+00  0.00000000e+00  4.06738007e+02]
 [ 0.00000000e+00  2.03642011e+00  0.00000000e+00 -2.42048584e+02]
 [ 0.00000000e+00  0.00000000e+00  3.00000000e+00 -1.45450000e+03]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00]] 
    - label_meta_dict/spatial_shape: [400 400 284] 
    - label_meta_dict/original_channel_dim: no_channel 
    - label_meta_dict/filename_or_obj: /cvhci/data/AutoPET/AutoPET/labelsTr/tumor_100.nii.gz

with use_thread_workers not set

[2023-06-27 09:24:10.781][INFO] __call__ - Type of batch data: <class 'dict'>
- image(MetaTensor) size: torch.Size([400, 400, 284]) size in MB: 173.33984375MB device: cpu dtype: torch.float32 
  Meta: {'aux_file': '', 'bitpix': '32', 'cal_max': '0', 'cal_min': '0', 'datatype': '16', 'descrip': '', 'dim[0]': '3', 'dim[1]': '400', 'dim[2]': '400', 'dim[3]': '284', 'dim[4]': '1', 'dim[5]': '1', 'dim[6]': '1', 'dim[7]': '1', 'dim_info': '0', 'intent_code': '0', 'intent_name': '', 'intent_p1': '0', 'intent_p2': '0', 'intent_p3': '0', 'nifti_type': '1', 'pixdim[0]': '0', 'pixdim[1]': '2.03642', 'pixdim[2]': '2.03642', 'pixdim[3]': '3', 'pixdim[4]': '1', 'pixdim[5]': '1', 'pixdim[6]': '1', 'pixdim[7]': '1', 'qfac': 0.0, 'qform_code': '0', 'qform_code_name': 'NIFTI_XFORM_UNKNOWN', 'qoffset_x': '0', 'qoffset_y': '0', 'qoffset_z': '0', 'qto_xyz': array([[2.03642, 0.     , 0.     , 0.     ],
       [0.     , 2.03642, 0.     , 0.     ],
       [0.     , 0.     , 3.     , 0.     ],
       [0.     , 0.     , 0.     , 1.     ]], dtype=float32), 'quatern_b': '0', 'quatern_c': '0', 'quatern_d': '0', 'scl_inter': '0', 'scl_slope': '1', 'sform_code': '2', 'sform_code_name': 'NIFTI_XFORM_ALIGNED_ANAT', 'slice_code': '0', 'slice_duration': '0', 'slice_end': '0', 'slice_start': '0', 'srow_x': '-2.03642 0 0 406.811', 'srow_y': '-0 2.03642 0 -212.583', 'srow_z': '0 -0 3 -1163', 'toffset': '0', 'vox_offset': '352', 'xyzt_units': '0', 'spacing': array([2.03642011, 2.03642011, 3.        ]), original_affine: array([[-2.03642011e+00,  0.00000000e+00,  0.00000000e+00,
         4.06811005e+02],
       [ 0.00000000e+00,  2.03642011e+00,  0.00000000e+00,
        -2.12582581e+02],
       [ 0.00000000e+00,  0.00000000e+00,  3.00000000e+00,
        -1.16300000e+03],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         1.00000000e+00]]), space: RAS, affine: tensor([[-2.0364e+00,  0.0000e+00,  0.0000e+00,  4.0681e+02],
        [ 0.0000e+00,  2.0364e+00,  0.0000e+00, -2.1258e+02],
        [ 0.0000e+00,  0.0000e+00,  3.0000e+00, -1.1630e+03],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00]],
       dtype=torch.float64), spatial_shape: array([400, 400, 284]), original_channel_dim: 'no_channel', 'filename_or_obj': '/cvhci/data/AutoPET/AutoPET/imagesTr/tumor_102.nii.gz'}
- label(MetaTensor) size: torch.Size([400, 400, 284]) size in MB: 173.33984375MB device: cpu dtype: torch.float32 
  Meta: {'aux_file': '', 'bitpix': '8', 'cal_max': '0', 'cal_min': '0', 'datatype': '2', 'descrip': '', 'dim[0]': '3', 'dim[1]': '400', 'dim[2]': '400', 'dim[3]': '284', 'dim[4]': '1', 'dim[5]': '1', 'dim[6]': '1', 'dim[7]': '1', 'dim_info': '0', 'intent_code': '0', 'intent_name': '', 'intent_p1': '0', 'intent_p2': '0', 'intent_p3': '0', 'nifti_type': '1', 'pixdim[0]': '0', 'pixdim[1]': '2.03642', 'pixdim[2]': '2.03642', 'pixdim[3]': '3', 'pixdim[4]': '1', 'pixdim[5]': '1', 'pixdim[6]': '1', 'pixdim[7]': '1', 'qfac': 0.0, 'qform_code': '0', 'qform_code_name': 'NIFTI_XFORM_UNKNOWN', 'qoffset_x': '0', 'qoffset_y': '0', 'qoffset_z': '0', 'qto_xyz': array([[2.03642, 0.     , 0.     , 0.     ],
       [0.     , 2.03642, 0.     , 0.     ],
       [0.     , 0.     , 3.     , 0.     ],
       [0.     , 0.     , 0.     , 1.     ]], dtype=float32), 'quatern_b': '0', 'quatern_c': '0', 'quatern_d': '0', 'scl_inter': '0', 'scl_slope': '1', 'sform_code': '2', 'sform_code_name': 'NIFTI_XFORM_ALIGNED_ANAT', 'slice_code': '0', 'slice_duration': '0', 'slice_end': '0', 'slice_start': '0', 'srow_x': '-2.03642 0 0 406.811', 'srow_y': '-0 2.03642 0 -212.583', 'srow_z': '0 -0 3 -1163', 'toffset': '0', 'vox_offset': '352', 'xyzt_units': '0', 'spacing': array([2.03642011, 2.03642011, 3.        ]), original_affine: array([[-2.03642011e+00,  0.00000000e+00,  0.00000000e+00,
         4.06811005e+02],
       [ 0.00000000e+00,  2.03642011e+00,  0.00000000e+00,
        -2.12582581e+02],
       [ 0.00000000e+00,  0.00000000e+00,  3.00000000e+00,
        -1.16300000e+03],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         1.00000000e+00]]), space: RAS, affine: tensor([[-2.0364e+00,  0.0000e+00,  0.0000e+00,  4.0681e+02],
        [ 0.0000e+00,  2.0364e+00,  0.0000e+00, -2.1258e+02],
        [ 0.0000e+00,  0.0000e+00,  3.0000e+00, -1.1630e+03],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00]],
       dtype=torch.float64), spatial_shape: array([400, 400, 284]), original_channel_dim: 'no_channel', 'filename_or_obj': '/cvhci/data/AutoPET/AutoPET/labelsTr/tumor_102.nii.gz'}
- image_meta_dict(dict)
    - image_meta_dict/aux_file:  
    - image_meta_dict/bitpix: 32 
    - image_meta_dict/cal_max: 0 
    - image_meta_dict/cal_min: 0 
    - image_meta_dict/datatype: 16 
    - image_meta_dict/descrip:  
    - image_meta_dict/dim[0]: 3 
    - image_meta_dict/dim[1]: 400 
    - image_meta_dict/dim[2]: 400 
    - image_meta_dict/dim[3]: 284 
    - image_meta_dict/dim[4]: 1 
    - image_meta_dict/dim[5]: 1 
    - image_meta_dict/dim[6]: 1 
    - image_meta_dict/dim[7]: 1 
    - image_meta_dict/dim_info: 0 
    - image_meta_dict/intent_code: 0 
    - image_meta_dict/intent_name:  
    - image_meta_dict/intent_p1: 0 
    - image_meta_dict/intent_p2: 0 
    - image_meta_dict/intent_p3: 0 
    - image_meta_dict/nifti_type: 1 
    - image_meta_dict/pixdim[0]: 0 
    - image_meta_dict/pixdim[1]: 2.03642 
    - image_meta_dict/pixdim[2]: 2.03642 
    - image_meta_dict/pixdim[3]: 3 
    - image_meta_dict/pixdim[4]: 1 
    - image_meta_dict/pixdim[5]: 1 
    - image_meta_dict/pixdim[6]: 1 
    - image_meta_dict/pixdim[7]: 1 
    - image_meta_dict/qfac: 0.0 
    - image_meta_dict/qform_code: 0 
    - image_meta_dict/qform_code_name: NIFTI_XFORM_UNKNOWN 
    - image_meta_dict/qoffset_x: 0 
    - image_meta_dict/qoffset_y: 0 
    - image_meta_dict/qoffset_z: 0 
    - image_meta_dict/qto_xyz: [[2.03642 0.      0.      0.     ]
 [0.      2.03642 0.      0.     ]
 [0.      0.      3.      0.     ]
 [0.      0.      0.      1.     ]] 
    - image_meta_dict/quatern_b: 0 
    - image_meta_dict/quatern_c: 0 
    - image_meta_dict/quatern_d: 0 
    - image_meta_dict/scl_inter: 0 
    - image_meta_dict/scl_slope: 1 
    - image_meta_dict/sform_code: 2 
    - image_meta_dict/sform_code_name: NIFTI_XFORM_ALIGNED_ANAT 
    - image_meta_dict/slice_code: 0 
    - image_meta_dict/slice_duration: 0 
    - image_meta_dict/slice_end: 0 
    - image_meta_dict/slice_start: 0 
    - image_meta_dict/srow_x: -2.03642 0 0 406.811 
    - image_meta_dict/srow_y: -0 2.03642 0 -212.583 
    - image_meta_dict/srow_z: 0 -0 3 -1163 
    - image_meta_dict/toffset: 0 
    - image_meta_dict/vox_offset: 352 
    - image_meta_dict/xyzt_units: 0 
    - image_meta_dict/spacing: [2.03642011 2.03642011 3.        ] 
    - image_meta_dict/original_affine: [[-2.03642011e+00  0.00000000e+00  0.00000000e+00  4.06811005e+02]
 [ 0.00000000e+00  2.03642011e+00  0.00000000e+00 -2.12582581e+02]
 [ 0.00000000e+00  0.00000000e+00  3.00000000e+00 -1.16300000e+03]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00]] 
    - image_meta_dict/space: RAS 
    - image_meta_dict/affine(Tensor/MetaTensor) size: torch.Size([4, 4]) size in MB: 0.0001220703125MB device: cpu dtype: torch.float64
    - image_meta_dict/spatial_shape: [400 400 284] 
    - image_meta_dict/original_channel_dim: no_channel 
    - image_meta_dict/filename_or_obj: /cvhci/data/AutoPET/AutoPET/imagesTr/tumor_102.nii.gz 
- label_meta_dict(dict)
    - label_meta_dict/aux_file:  
    - label_meta_dict/bitpix: 8 
    - label_meta_dict/cal_max: 0 
    - label_meta_dict/cal_min: 0 
    - label_meta_dict/datatype: 2 
    - label_meta_dict/descrip:  
    - label_meta_dict/dim[0]: 3 
    - label_meta_dict/dim[1]: 400 
    - label_meta_dict/dim[2]: 400 
    - label_meta_dict/dim[3]: 284 
    - label_meta_dict/dim[4]: 1 
    - label_meta_dict/dim[5]: 1 
    - label_meta_dict/dim[6]: 1 
    - label_meta_dict/dim[7]: 1 
    - label_meta_dict/dim_info: 0 
    - label_meta_dict/intent_code: 0 
    - label_meta_dict/intent_name:  
    - label_meta_dict/intent_p1: 0 
    - label_meta_dict/intent_p2: 0 
    - label_meta_dict/intent_p3: 0 
    - label_meta_dict/nifti_type: 1 
    - label_meta_dict/pixdim[0]: 0 
    - label_meta_dict/pixdim[1]: 2.03642 
    - label_meta_dict/pixdim[2]: 2.03642 
    - label_meta_dict/pixdim[3]: 3 
    - label_meta_dict/pixdim[4]: 1 
    - label_meta_dict/pixdim[5]: 1 
    - label_meta_dict/pixdim[6]: 1 
    - label_meta_dict/pixdim[7]: 1 
    - label_meta_dict/qfac: 0.0 
    - label_meta_dict/qform_code: 0 
    - label_meta_dict/qform_code_name: NIFTI_XFORM_UNKNOWN 
    - label_meta_dict/qoffset_x: 0 
    - label_meta_dict/qoffset_y: 0 
    - label_meta_dict/qoffset_z: 0 
    - label_meta_dict/qto_xyz: [[2.03642 0.      0.      0.     ]
 [0.      2.03642 0.      0.     ]
 [0.      0.      3.      0.     ]
 [0.      0.      0.      1.     ]] 
    - label_meta_dict/quatern_b: 0 
    - label_meta_dict/quatern_c: 0 
    - label_meta_dict/quatern_d: 0 
    - label_meta_dict/scl_inter: 0 
    - label_meta_dict/scl_slope: 1 
    - label_meta_dict/sform_code: 2 
    - label_meta_dict/sform_code_name: NIFTI_XFORM_ALIGNED_ANAT 
    - label_meta_dict/slice_code: 0 
    - label_meta_dict/slice_duration: 0 
    - label_meta_dict/slice_end: 0 
    - label_meta_dict/slice_start: 0 
    - label_meta_dict/srow_x: -2.03642 0 0 406.811 
    - label_meta_dict/srow_y: -0 2.03642 0 -212.583 
    - label_meta_dict/srow_z: 0 -0 3 -1163 
    - label_meta_dict/toffset: 0 
    - label_meta_dict/vox_offset: 352 
    - label_meta_dict/xyzt_units: 0 
    - label_meta_dict/spacing: [2.03642011 2.03642011 3.        ] 
    - label_meta_dict/original_affine: [[-2.03642011e+00  0.00000000e+00  0.00000000e+00  4.06811005e+02]
 [ 0.00000000e+00  2.03642011e+00  0.00000000e+00 -2.12582581e+02]
 [ 0.00000000e+00  0.00000000e+00  3.00000000e+00 -1.16300000e+03]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00]] 
    - label_meta_dict/space: RAS 
    - label_meta_dict/affine(Tensor/MetaTensor) size: torch.Size([4, 4]) size in MB: 0.0001220703125MB device: cpu dtype: torch.float64
    - label_meta_dict/spatial_shape: [400 400 284] 
    - label_meta_dict/original_channel_dim: no_channel 
    - label_meta_dict/filename_or_obj: /cvhci/data/AutoPET/AutoPET/labelsTr/tumor_102.nii.gz

2 replies

diazandr3s Jun 27, 2023
Collaborator

Thanks for reporting this @matt3o. Just for me to understand best the task here.

Are you trying to segment single-label (binary), right? Is it for tumour segmentation using PETs?

If the network output is a single output layer, what type of activation are you using after getting that value and then computing the CrossEntropyLoss?

From the pre-transforms, I see some of them and some arguments are not typical:

ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True)

Do you really want to only consider HU from 0 to 43?

Regarding the spacing, which spacing are you applying?
Do you really need to apply all this flipping and rotation?

RandFlipd(keys=("image", "label"), spatial_axis=[0], prob=0.10),
RandFlipd(keys=("image", "label"), spatial_axis=[1], prob=0.10),
RandFlipd(keys=("image", "label"), spatial_axis=[2], prob=0.10),
RandRotate90d(keys=("image", "label"), prob=0.10, max_k=3),

If you already use CropFroreground, do you think you really need this transform: RandCropByPosNegLabeld?

My hypothesis is that the issue might come from applying all these transforms to your dataset :/

matt3o Jun 27, 2023
Author

Hey @diazandr3s! Thanks for the quick response. This code is running on the AutoPET dataset currently, so whole-body interactive segmentation.

I took over the code from my tutor, so most of those settings have been done by him. As I said, the code works perfectly with DataLoader but not with ThreadDataLoader with use_thread_workers=True.

No clue, has been set by him.
I believe that the spacing from the loaded image file is applied, that's also what goes wrong here. The affine matrix is not loaded correctly.
Yes they are for data augmentation.
Yes totally. CropForeground has been added since the data has large black margins where is there is no real data. This crop simply makes training a lot faster by removing those areas. The RandCropbyPosNegLabel is to remove any bias and to vary the training data, so it makes sense too. I have 0.83 Dice on full volume (10 click) AutoPET now, which I am pretty happy about

Since the code works with the normal dataloader I don't think this is related in any way to the transforms..

matt3o · 2023-07-17T12:28:17Z

matt3o
Jul 17, 2023
Author

@wyli I do have some more context for you now. The data_array is not of type MetaTensor, assuming affine to be identity warning only appears without setting multiprocessing_context='spawn'. I looked into the data and saw that this option changes the return type of the image and the label. Instead of

- image(Tensor) size: torch.Size([1, 400, 400, 254]) size in MB: 155.029296875MB device: cpu dtype: torch.float32 
- label(Tensor) size: torch.Size([1, 400, 400, 254]) size in MB: 155.029296875MB device: cpu dtype: torch.float32

I now get:

- image(MetaTensor) size: torch.Size([1, 400, 400, 254]) size in MB: 155.029296875MB device: cpu dtype: torch.float32 
  Meta: {'aux_file': '', 'bitpix': '32', 'cal_max': '0', 'cal_min': '0', 'datatype': '16', 'descrip': '', 'dim_info': '0', 'intent_code': '0', 'intent_name': '', 'intent_p1': '0', 'intent_p2': '0', 'intent_p3': '0', 'nifti_type': '1', 'qfac': 0.0, 'qform_code': '0', 'qform_code_name': 'NIFTI_XFORM_UNKNOWN', 'qto_xyz': array([[2.03642, 0.     , 0.     , 0.     ],
       [0.     , 2.03642, 0.     , 0.     ],
       [0.     , 0.     , 3.     , 0.     ],
       [0.     , 0.     , 0.     , 1.     ]], dtype=float32), 'scl_inter': '0', 'scl_slope': '1', 'sform_code': '2', 'sform_code_name': 'NIFTI_XFORM_ALIGNED_ANAT', 'slice_code': '0', 'slice_duration': '0', 'slice_end': '0', 'slice_start': '0', 'toffset': '0', 'vox_offset': '352', 'xyzt_units': '0', 'spacing': array([2.03642011, 2.03642011, 3.        ]), original_affine: array([[-2.03642011e+00,  0.00000000e+00,  0.00000000e+00,
         4.06665009e+02],
       [ 0.00000000e+00,  2.03642011e+00,  0.00000000e+00,
        -2.17039581e+02],
       [ 0.00000000e+00,  0.00000000e+00,  3.00000000e+00,
        -1.04950000e+03],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         1.00000000e+00]]), space: RAS, affine: tensor([[-2.0364e+00,  0.0000e+00,  0.0000e+00,  4.0667e+02],
        [ 0.0000e+00,  2.0364e+00,  0.0000e+00, -2.1704e+02],
        [ 0.0000e+00,  0.0000e+00,  3.0000e+00, -1.0495e+03],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00]],
       dtype=torch.float64), spatial_shape: array([400, 400, 254]), original_channel_dim: nan, 'filename_or_obj': '/cvhci/data/AutoPET/AutoPET/imagesTr/tumor_0.nii.gz'}

Not sure if it is intended that changing a single dataloader flag leads to a completely different returned data type. So the affine information is not getting read correctly because of the loaded image and label not being of type MetaTensor. Even by explicitly adding a ToTensord with track_meta=True I cannot convert the image and label Tensor to a MetaTensor which makes it all even more confusing.

Maybe that helps debugging the issue.

2 replies

wyli Jul 17, 2023
Collaborator

Hi @matt3o I'm trying to replicate what you mentioned with

import torch
import monai.transforms as mt
from monai.data import ArrayDataset, DataLoader, MetaTensor

NETWORK_INPUT_SHAPE = (1, 128, 128, 256)
NUM_IMAGES = 50


def get_xy():
    xs = [256 * MetaTensor(torch.rand(NETWORK_INPUT_SHAPE)) for _ in range(NUM_IMAGES)]
    ys = [MetaTensor(torch.rand(NETWORK_INPUT_SHAPE)) for _ in range(NUM_IMAGES)]
    return xs, ys


transform = mt.Compose([mt.ToDevice(device="cpu")])


def get_data_loader():
    x, y = get_xy()
    dataset = ArrayDataset(x, seg=y, img_transform=transform, seg_transform=transform)
    loader = DataLoader(dataset, num_workers=1, batch_size=1)
    return loader


if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader = get_data_loader()
    for x in train_loader:
        print(type(x[0]))

but the outputs are metatensors. could you help revise the script to show the issue? (also tried the threadloader works fine)

matt3o Jul 27, 2023
Author

@wyli

import glob
import os
import logging

import monai.transforms as mt
import torch
from monai.data import ArrayDataset, DataLoader, MetaTensor
from monai.data.dataset import Dataset, PersistentDataset

from monai.data import set_track_meta

NETWORK_INPUT_SHAPE = (1, 128, 128, 256)
NUM_IMAGES = 1

logger = logging.getLogger("sw_interactive_segmentation")
if logger.hasHandlers():
    logger.handlers.clear()
logger.propagate = False
logger.setLevel(logging.DEBUG)
stream_handler = logging.StreamHandler()
# (%(name)s)
formatter = logging.Formatter(
    fmt="[%(asctime)s.%(msecs)03d][%(levelname)s] %(funcName)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
stream_handler.setFormatter(formatter)
stream_handler.setLevel(logging.DEBUG)
logger.addHandler(stream_handler)


if __name__ == "__main__":
    location = "/cvhci/data/AutoPET/AutoPET"
    all_images = sorted(glob.glob(os.path.join(location, "imagesTr", "*.nii.gz")))
    all_labels = sorted(glob.glob(os.path.join(location, "labelsTr", "*.nii.gz")))
    datalist = [
        {"image": image_name, "label": label_name}
        for image_name, label_name in zip(all_images, all_labels)
    ]  # if image_name not in bad_images]

    datalist = datalist[0:1]
    device = "cuda"

    transform = mt.Compose(
        [
            mt.LoadImaged(
                keys="image",
                reader="ITKReader",
                image_only=False,
                simple_keys=True,
            ),
        ]
    )

    # train_ds = Dataset(datalist, transform)

    train_ds = Dataset(
            datalist, transform#, cache_dir='/tmp/cache/'
        )
    
    train_ds2 = Dataset(
        datalist, transform#, cache_dir='/tmp/cache/'
    )


    train_loader = DataLoader(
        train_ds,
        shuffle=True,  num_workers=1, batch_size=1, multiprocessing_context='spawn', #persistent_workers=True,
    )


    train_loader2 = DataLoader(
        train_ds2,
        shuffle=True,  num_workers=1, batch_size=1#, multiprocessing_context='spawn', #persistent_workers=True,
    )
    set_track_meta(False)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    for x in train_loader:
        print(type(x["image"]))
    
    for x in train_loader2:
        print(type(x["image"]))

    print(type(transform(datalist[0])["image"]))

This code triggers the bug. Just load any image I think, I just copied over some AutoPET code.
Output on my PC is

<class 'monai.data.meta_tensor.MetaTensor'>
<class 'torch.Tensor'>
<class 'torch.Tensor'>

expected would be for the first one to return a torch.Tensor as well.

I was not aware that the code I am using is setting set_track_meta(False), therefore I had to do some debugging.
Imo this is a bug, or at least extremly unexpected behaviour.

Update: Pasted the code twice and undid that. Tested and it worked on my machine. I'll open another bug report soon if I don't get an answer here, pretty sure this is a bug..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6657

{{title}}

Replies: 15 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6657

matt3o Jun 21, 2023

Replies: 15 comments · 5 replies

wyli Jun 22, 2023 Collaborator

matt3o Jun 22, 2023 Author

matt3o Jun 22, 2023 Author

wyli Jun 22, 2023 Collaborator

matt3o Jun 22, 2023 Author

matt3o Jun 22, 2023 Author

matt3o Jun 22, 2023 Author

wyli Jun 22, 2023 Collaborator

diazandr3s Jun 27, 2023 Collaborator

matt3o Jun 22, 2023 Author

matt3o Jun 22, 2023 Author

wyli Jun 22, 2023 Collaborator

matt3o Jun 22, 2023 Author

wyli Jun 26, 2023 Collaborator

matt3o Jun 27, 2023 Author

diazandr3s Jun 27, 2023 Collaborator

matt3o Jun 27, 2023 Author

matt3o Jul 17, 2023 Author

wyli Jul 17, 2023 Collaborator

matt3o Jul 27, 2023 Author

matt3o
Jun 21, 2023

Replies: 15 comments 5 replies

wyli
Jun 22, 2023
Collaborator

matt3o
Jun 22, 2023
Author

matt3o
Jun 22, 2023
Author

wyli
Jun 22, 2023
Collaborator

matt3o
Jun 22, 2023
Author

matt3o
Jun 22, 2023
Author

matt3o
Jun 22, 2023
Author

wyli
Jun 22, 2023
Collaborator

diazandr3s Jun 27, 2023
Collaborator

matt3o
Jun 22, 2023
Author

matt3o
Jun 22, 2023
Author

wyli
Jun 22, 2023
Collaborator

matt3o
Jun 22, 2023
Author

wyli
Jun 26, 2023
Collaborator

matt3o
Jun 27, 2023
Author

diazandr3s Jun 27, 2023
Collaborator

matt3o Jun 27, 2023
Author

matt3o
Jul 17, 2023
Author

wyli Jul 17, 2023
Collaborator

matt3o Jul 27, 2023
Author