Feature Request: Use pigz #40

xrobau · 2020-04-29T22:51:23Z

As CPUs are massively faster than anything else we have, we should be compressing data before moving it around. pigz is a multi-threaded implementation of gzip, which scales pretty much infinitely.

I've been running an older hacked version of zfs_backup which uses bash -c on the remote machine, and I was going to do a proper PR for this, but there seems to be some strange issue that I can't figure out.

This is my send command, which works fine:

ssh storage1 '/bin/bash' '-c' '( zfs send' '-L' '-e' '-c' '-D' '-v' '-P' '-p' 'pool1/mainstore@backup-20200428071553' '|pigz)'

I then use shell=True on the local Popen to allow this as you want to keep python out of the way of this, as it'll just be moving data in and out of memory and slowing things down (the second line is self.debug(encoded_cmd) before the Popen, on line 413):

# [Target] Piping input
# [Target] [b'/usr/bin/pigz', b'-d', b'|', b'zfs', b'recv', b'-u', b'-v', b'bigmirror/storage1/pool1/mainstore']
! DATASET FAILED: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
! Exception: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

This looks like the new version is trying to readline.decode(utf-8) the binary data coming from zfs send (or, in this case, zfs send | pigz), and I can't understand why that would be happening.

Any hints?

The text was updated successfully, but these errors were encountered:

gdevenyi · 2020-04-29T23:09:45Z

Rather than hard-coding a specific compression tool, I think instead we should either make configurable, or provide some pre-configured options, as syncoid does:
https://github.com/jimsalterjrs/sanoid/blob/master/syncoid#L847-L905

P.S. zstd is way better than pigz if available

xrobau · 2020-04-30T01:08:30Z

Agree, but pigz is a simple lowest-common-denominator that was going to be the START of the PR, but until I can understand why it's trying to readline() the gzip'ed data, I'm confused 8-\

psy0rz · 2020-04-30T10:55:45Z

To do it in a correct way, its not as simple as just adding a pipeline somewhere. Also piping "via" python, like we do, doesn't mean python actually processes all the data. (its an actual unix pipe)

I agree this should be a feature, just like mbuffer support. This first requires extending ExecuteNode in a way that allows adding piped commands (either locally or remote via ssh).

psy0rz · 2020-07-25T08:19:19Z

Depends on #50

psy0rz · 2021-05-15T14:20:28Z

i added the same compression options as syncoid. I changed it so zstd uses zsdtmt for multithreading.

I would recommend zstd-fast as default compressor. (Its very fast and compresses pretty good)

psy0rz added the enhancement label Apr 30, 2020

psy0rz added this to the 3.1 milestone Apr 30, 2020

psy0rz closed this as completed in 30f30ba May 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Use pigz #40

Feature Request: Use pigz #40

xrobau commented Apr 29, 2020

gdevenyi commented Apr 29, 2020

xrobau commented Apr 30, 2020

psy0rz commented Apr 30, 2020

psy0rz commented Jul 25, 2020

psy0rz commented May 15, 2021

Feature Request: Use pigz #40

Feature Request: Use pigz #40

Comments

xrobau commented Apr 29, 2020

gdevenyi commented Apr 29, 2020

xrobau commented Apr 30, 2020

psy0rz commented Apr 30, 2020

psy0rz commented Jul 25, 2020

psy0rz commented May 15, 2021