This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/CherubimHD on 2024-09-04 21:24:07+00:00.


I’m collecting trajectories for imitation learning (RL) and each trajectory is about 1500 time steps long, consists of 4 image streams of about 600x600 pixels. Obviously, the dataset size grows extremely quickly with the number of trajectories.

What are some good libraries for efficiently (in terms of disk space) storing such data? I tried h5py with level 9 gzip compression but the files are still way too large. Is there a better alternative?

Saving and loading times do not really matter.

Most resources online are aimed at efficiently loading large datasets or handling them in memory which is not relevant for my question.

I already use uint8 as datatype for the rgb streams.

UPDATE: I ended up using lossy video compression via scikit-video. This results in a filesize of just 2MB instead of almost 2GB when storing raw frames in an array. A histogram of the reconstruction loss shows that most pixel differences are in the low single digit range which is not a problem in my case since I would apply domain randomisation through noise anyway.