Dataloading Revamp #3216

AntonioMacaronio · 2024-06-12T11:21:55Z

Problems and Background

With a sufficiently large enough dataset, the current parallel_datamanager.py will try to cache the entire dataset into RAM, which will lead to an OOM error
parallel_datamanager.py only uses one worker to generate ray bundles. Since various subprocesses such as unprojecting during ray generation, or pixel sampling within a custom mask can be a CPU-intensive task, it may be better suited to parallelize this. While parallel_datamanager.py does support multiple workers, each worker caches the entire dataset to RAM and it does not support massive datasets, leading to duplicate copies of the dataset in computer memory.
Additionally, both VanillaDataManager and ParallelDataManager rely on CacheDataloader, which subclasses torch.utils.data.DataLoader, which is a strange coding practice
Similarly for full_images_datamanager.py: As we can not fit the entire dataset in RAM, the current implementation loads in entire dataset into the FullImageDataloader's cached_train attribute. To do this efficiently, we need multiprocess parallelization to load in a batch of images (support for batched image dataloading since gsplat now supports batched rastuerization)

Overview of Changes

Replacing CacheDataloader with RayBatchStream, which subclasses torch.utils.data.IterableDataset. The goal of this class is to generate ray bundles directly without caching all images to RAM. This is done by collating a sampled batch of images to sample from.
Adding an ImageBatchStream to expand while simplifying FullImageDataloader
A new pil_to_numpy() function is added. This function reads a PIL.Image's data buffer and fills an empty numpy array while reading, hastening the conversion process and removing an extra memory allocation. It is the fastest way to get from a PIL Image to a Pytorch tensor averaging ~2.5ms for a 1080x1920 image (~40% faster)

Impact

Checkout these comparisons! The left was trained on 200 images of a 4k video, while the right was trained on 2000 images of the same 4k video.

…te_fn

pwais

nice progress! sorry its not fast but i think i know why:

i think the main reason this is slower than expected is because _get_collated_batch() gets called per raybundle and sadly _get_collated_batch() is AFAIK needlessly slow.

take note about how the current CachedDataloader avoids doing _get_collated_batch() per raybundle. it would have been nice for the author to have left some notes about how slow _get_collated_batch() is, but evidently that author found it's necessary to not collate images per raybundle .
in my impl, I just _get_collated_batch() once on a small set of images an keep that batch cached. the main problem I saw is that _get_collated_batch() on thousands of images seemed to use 2x or 3x as much RAM as actually needed and thus cause many minutes of swapping and stuff

Even if you only call _get_collated_batch() once tho, you might need a bigger prefetch factor and/or more workers depending on the model.

IMO it's worth trying to find a way to get the result of nerfstudio_collate on cameras (I think the cameras do need to be collated because they can be ragged? i could be wrong and they don't need collation) but on images just have the worker read image files / buffers and never call collate on those tensors.

Just to be clear, this is the line where collate on images can go nuts and start taking forever to allocate 200GB or more of RAM for many images in code in main:

nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py

Line 101 in 3d27dd4

storage = elem.storage()._new_shared(numel, device=elem.device)

So! If a worker is just emitting raybundles then the images never need to be in shared tensor memory then eh? Thus should be able to save some RAM and CPU by skipping that line for images. Still need to think about the cost of reading the images themselves, but collate is definitely a troublemaker.

nerfstudio/data/datamanagers/base_datamanager.py

pwais · 2024-06-17T04:46:05Z

nerfstudio/data/datamanagers/base_datamanager.py

+    """The limit number of batches a worker will start loading once an iterator is created. 
+    Each next() call on the iterator has the CPU prepare more batches up to this 
+    limit while the GPU is performing forward and backward passes on the model."""
+    dataloader_num_workers: int = 2


FWIW for a 3090 i was using 16 workers and prefetch factor of 16, and a train ray batch size of 24000. And I was getting the same "rays per sec" in the console output or better as with the in-repo impl (ParallelDataManger). Steady-state I was using less than 16 CPU I believe. If batch size and prefetch is small, then definitely need more workers

pwais · 2024-06-17T04:50:45Z

nerfstudio/data/datamanagers/base_datamanager.py

+        self.device = device
+        self.collate_fn = collate_fn
+        # self.num_workers = kwargs.get("num_workers", 32) # nb only 4 in defaults
+        self.num_image_load_threads = num_image_load_threads  # kwargs.get("num_workers", 4) # nb only 4 in defaults


this is really a hack to hide disk I/O ... I only needed 2 here. it really depends on how much RAM / disk cache the user has.

pwais · 2024-06-17T04:51:53Z

nerfstudio/data/datamanagers/base_datamanager.py

+        #     print(indices)
+        # print(type(batch_list[0])) # prints <class 'dict'>
+        # print(self.collate_fn) # prints nerfstudio_collate
+        collated_batch = self.collate_fn(batch_list)


i wish we knew if there's some way to get rid of collate on images because that appears to be the biggest waste

nerfstudio/data/datamanagers/base_datamanager.py

pwais · 2024-06-17T04:59:54Z

nerfstudio/data/datamanagers/base_datamanager.py

+        if self.config.use_ray_train_dataloader:
+            import torch.multiprocessing as mp
+
+            mp.set_start_method("spawn")


i think this should get removed and something farther up the stack should call if needed. i think we shouldn't need it if the workers don't use cuda?

…ndles

pwais

just took a quick look (can't do a full review right now), so cool to see this coming along!!

Sounds like this change will target the case that uncompressed image tensors can't fit in RAM, but the raw image files (typically jpeg) do fit in RAM. In that case I guess we do want each worker to literally load the file bytes into Python RAM (as implemented) versus let the OS disk cache work, because the idea is that the uncompressed image tensors will otherwise blow out the disk cache.

I think it would be important to test in the end like a case where the user only has limited RAM (say 16GB) and e.g. a 8GB laptop graphics card, in that case I think there are moderate or larger image datasets where the whole thing would OOM when using the current cache impl. In that case, it would be helpful to have some way to disable the cache, or just communicate to the user that they simply have too weak of a machine for the dataset (e.g. just a CONSOLE.print("[bold yellow]Warning ...") in the line where the workers start reading image files into RAM.

nerfstudio/data/datamanagers/base_datamanager.py

nerfstudio/data/utils/data_utils.py

nerfstudio/data/utils/dataloaders.py

…t32 on GPU tests

…way to contain this within the datamanager API

…ge, and cleaner desc for base_datamanager

pwais

Wow the visual results look amazing! Thank you so much for continuing to hack on this!

pwais · 2024-08-29T23:30:45Z

nerfstudio/data/datamanagers/full_images_datamanager.py

-class FullImageBatchStreamConfig(DataManagerConfig):
-    _target: Type = field(default_factory=lambda: ImageBatchStream)
-
+## Let's implement a parallelized splat dataloader!


pwais · 2024-08-30T01:11:13Z

nerfstudio/data/datamanagers/base_datamanager.py

+        "indices": torch.cat([batch_i["indices"] for batch_i in batch_list], dim=0),
+    }
+    # end = time.time()
+    # print((end - start) * 1000)


curious, this function is pretty fast right?

pwais · 2024-08-30T01:11:36Z

nerfstudio/data/datamanagers/base_datamanager.py

@@ -178,6 +206,7 @@ def __init__(self):
        self.train_count = 0
        self.eval_count = 0
        if self.train_dataset and self.test_mode != "inference":
+            # print(self.setup_train) # prints <bound method ParallelFullImageDatamanager.setup_train of ParallelFullImageDatamanager()>


Suggested change

# print(self.setup_train) # prints <bound method ParallelFullImageDatamanager.setup_train of ParallelFullImageDatamanager()>

pwais · 2024-08-30T01:13:56Z

nerfstudio/data/datamanagers/base_datamanager.py

+        self.exclude_batch_keys_from_device = exclude_batch_keys_from_device
+        # print("self.exclude_batch_keys_from_device", self.exclude_batch_keys_from_device) # usually prints ['image']
+        self.datamanager_config = datamanager_config
+        self.pixel_sampler: PixelSampler = None


Suggested change

self.pixel_sampler: PixelSampler = None

self.pixel_sampler: Optional[PixelSampler] = None # lazy init

pwais · 2024-08-30T01:14:12Z

nerfstudio/data/datamanagers/base_datamanager.py

+        # print("self.exclude_batch_keys_from_device", self.exclude_batch_keys_from_device) # usually prints ['image']
+        self.datamanager_config = datamanager_config
+        self.pixel_sampler: PixelSampler = None
+        self.ray_generator: RayGenerator = None


Suggested change

self.ray_generator: RayGenerator = None

self.ray_generator: Optional[RayGenerator] = None # lazy init

pwais · 2024-08-30T01:25:40Z

nerfstudio/data/datamanagers/full_images_datamanager.py

+            camera.metadata["cam_idx"] = idx
+            i += 1
+            if torch.sum(camera.camera_to_worlds) == 0:
+                print(i, camera.camera_to_worlds, "YOYO INSIDE IMAGEBATCHSTREAM")


yoyo im inside ur GPU eating ur RAMs!

pwais · 2024-08-30T01:27:33Z

nerfstudio/data/datamanagers/full_images_datamanager.py

+                camera.metadata = {}
+            camera.metadata["cam_idx"] = idx
+            i += 1
+            if torch.sum(camera.camera_to_worlds) == 0:


ratrow, this doesn't mean the splat optimizer will get empty poses as well?

pwais · 2024-08-30T01:36:19Z

nerfstudio/models/splatfacto.py

@@ -724,7 +724,7 @@ def get_outputs(self, camera: Cameras) -> Dict[str, Union[torch.Tensor, List]]:
            render_mode = "RGB+ED"
        else:
            render_mode = "RGB"
-
+        # breakpoint()


Suggested change

# breakpoint()

pwais · 2024-08-30T01:39:15Z

nerfstudio/scripts/datasets/process_project_aria.py

@@ -118,25 +123,101 @@ def read_trajectory_csv_to_dict(file_iterable_csv: str) -> TimedPoses:
    )


+def undistort_image_and_calibration(


maybe the changes in this file / module should be broken out into separate PR at some point. and could probably ship sooner too then (?)

pwais · 2024-08-30T01:46:00Z

nerfstudio/data/datamanagers/full_images_datamanager.py

+        self.input_dataset = input_dataset
+        self.device = device
+
+    def __iter__(self):


iteresting so this works fast enough to undistort and not need caching of the undistorted image? curious what CPU / GPU combo and number of workers was good here.

or maybe it does slow training a bit, but rather it works without OOM on bigger datasets :)

Yeah it does slow training quite a bit. With 1 worker, training time went from 9 minutes -> 18 minutes (with downsampling to a resolution inside 1600x1600)

It also depends on what type of undistortion is occuring. I only have tested this with datasets of CAMERA.PERSPECTIVE types that had small amounts of radial and tangential distortion, but with fisheye and equirectangular camera models, it will definitely take longer to undistort, will do some further benchmarks on this to find how many workers are needed

ah interesting! 2x is not that bad, given that this change makes training feasible when low RAM.

maybe caching the undistorted image could help then? so in your other code, you try to cache the jpeg bytes which helps, and clearly here you can't necessarily cache the compressed image and so you have to cache the undistorted image.

all that said, it's very nice to even have training enabled even if u don't implement / test caching

…ge_datamanager

AntonioMacaronio and others added 16 commits June 11, 2024 15:03

initial debugging and testing works

0471543

pwais changes with RayBatchStream to alleviate training

c6dde7d

Merge branch 'main' into dataloading-revamp

a09ea0c

few bugs to iron out with multiprocessing, specifically pickled colla…

78453cd

…te_fn

working version of RayBatchStream

f2bd96f

additional docstrings

d8b7430

cleanup

a5425d4

much more documentation

604f734

successfully trained AEA-script2_seq2 closed_loop without OOM

0143803

porting over aria dataset-size feature

d3527e2

added logic to handle eviction of a worker's cached_collated_batch

25f5f27

antonio's implementation of stream batches

3a8b63b

training on a dataset with 4000 images works!

536c6ca

some configuration speedups, loops aren't actually needed!

43a0061

quick fix adjustment to aria

fa7cf30

removed unnecessary looping

927cb6a

pwais reviewed Jun 17, 2024

View reviewed changes

AntonioMacaronio added 11 commits June 25, 2024 07:22

much faster training when adding i variable to collate every 5 ray bu…

814f2c2

…ndles

cleanup unnecssary variables in Dataloader

247ac3e

further cleanup

55d0803

adding caching of compressed images to RAM to reduce disk bottleneck

b6979a4

added caching to RAM for masks

81dbf7c

found fast way to collate - many tricks applied

55ca71d

quick update to aria to test on different datasets

3b4f091

cleaned up the accelerated pil_to_numpy function

7de1922

cleaning up PR

9ceaad1

this commit was used to generate the time metrics and profiling metrics

4147a6a

REAL commit used to run tests

5a55b7a

pwais reviewed Jul 28, 2024

View reviewed changes

testing with nerfacto-big

78f02e6

AntonioMacaronio added 8 commits August 15, 2024 06:36

generated RayBundle collate and converting images from uint8s to floa…

19bc4b5

…t32 on GPU tests

updating nerfacto to support uint8 easily, will need to figure out a …

9245d05

…way to contain this within the datamanager API

datamanager updates, both splat and nerf

3124c14

must use writeable arrays because torch requires them

afb0612

cleaned up base_dataset, added pickle to utils, more code in full_ima…

288a740

…ge, and cleaner desc for base_datamanager

lots of process on a parallel FullImageDatamanger

2fd0862

can train big splats with pre-assertion hack or ROI hack and 0 workers

846e2f3

fixed all undistortion issues with ParallelImageDatamanager

8fb0b4d

pwais reviewed Aug 30, 2024

View reviewed changes

AntonioMacaronio added 7 commits August 30, 2024 18:41

adding some downsampling and parallel tests with splatfacto!

ce3f83f

deleted commented code in dataloaders.py and added bugfix to shuffling

8ab9963

testing splatfacto-big

c9e16bf

cleaned up base_pipeline.py

ddac38d

cleaned up base_pipeline.py ACTUALLY THIS TIME, forgot to save last time

443719a

cleaned up a lot of code

d16e519

process_project_aria back to main branch and some cleanup in full_ima…

367d512

…ge_datamanager

AntonioMacaronio changed the title ~~Dataloading revamp~~ Dataloading Revamp Sep 1, 2024

AntonioMacaronio added 2 commits September 1, 2024 04:43

clarifying docstrings

d3d99b4

further PR cleanup

6f763dc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloading Revamp #3216

Dataloading Revamp #3216

AntonioMacaronio commented Jun 12, 2024 •

edited

Loading

pwais left a comment

pwais Jun 17, 2024

pwais Jun 17, 2024

pwais Jun 17, 2024

pwais Jun 17, 2024

pwais left a comment

pwais left a comment

pwais Aug 29, 2024

pwais Aug 30, 2024

pwais Aug 30, 2024

pwais Aug 30, 2024

pwais Aug 30, 2024

pwais Aug 30, 2024

pwais Aug 30, 2024

pwais Aug 30, 2024

pwais Aug 30, 2024

pwais Aug 30, 2024

AntonioMacaronio Sep 1, 2024 •

edited

Loading

pwais Sep 1, 2024

	self.pixel_sampler: PixelSampler = None
	self.pixel_sampler: Optional[PixelSampler] = None # lazy init

	self.ray_generator: RayGenerator = None
	self.ray_generator: Optional[RayGenerator] = None # lazy init

		@@ -118,25 +123,101 @@ def read_trajectory_csv_to_dict(file_iterable_csv: str) -> TimedPoses:
		)


		def undistort_image_and_calibration(

Dataloading Revamp #3216

Are you sure you want to change the base?

Dataloading Revamp #3216

Conversation

AntonioMacaronio commented Jun 12, 2024 • edited Loading

pwais left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwais left a comment

Choose a reason for hiding this comment

pwais left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AntonioMacaronio Sep 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AntonioMacaronio commented Jun 12, 2024 •

edited

Loading

AntonioMacaronio Sep 1, 2024 •

edited

Loading