Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Use pigz #40

Closed
xrobau opened this issue Apr 29, 2020 · 5 comments
Closed

Feature Request: Use pigz #40

xrobau opened this issue Apr 29, 2020 · 5 comments
Milestone

Comments

@xrobau
Copy link
Contributor

xrobau commented Apr 29, 2020

As CPUs are massively faster than anything else we have, we should be compressing data before moving it around. pigz is a multi-threaded implementation of gzip, which scales pretty much infinitely.

I've been running an older hacked version of zfs_backup which uses bash -c on the remote machine, and I was going to do a proper PR for this, but there seems to be some strange issue that I can't figure out.

This is my send command, which works fine:

ssh storage1 '/bin/bash' '-c' '( zfs send' '-L' '-e' '-c' '-D' '-v' '-P' '-p' 'pool1/mainstore@backup-20200428071553' '|pigz)'

I then use shell=True on the local Popen to allow this as you want to keep python out of the way of this, as it'll just be moving data in and out of memory and slowing things down (the second line is self.debug(encoded_cmd) before the Popen, on line 413):

# [Target] Piping input
# [Target] [b'/usr/bin/pigz', b'-d', b'|', b'zfs', b'recv', b'-u', b'-v', b'bigmirror/storage1/pool1/mainstore']
! DATASET FAILED: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
! Exception: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

This looks like the new version is trying to readline.decode(utf-8) the binary data coming from zfs send (or, in this case, zfs send | pigz), and I can't understand why that would be happening.

Any hints?

@gdevenyi
Copy link

Rather than hard-coding a specific compression tool, I think instead we should either make configurable, or provide some pre-configured options, as syncoid does:
https://github.com/jimsalterjrs/sanoid/blob/master/syncoid#L847-L905

P.S. zstd is way better than pigz if available

@xrobau
Copy link
Contributor Author

xrobau commented Apr 30, 2020

Agree, but pigz is a simple lowest-common-denominator that was going to be the START of the PR, but until I can understand why it's trying to readline() the gzip'ed data, I'm confused 8-\

@psy0rz
Copy link
Owner

psy0rz commented Apr 30, 2020

To do it in a correct way, its not as simple as just adding a pipeline somewhere. Also piping "via" python, like we do, doesn't mean python actually processes all the data. (its an actual unix pipe)

I agree this should be a feature, just like mbuffer support. This first requires extending ExecuteNode in a way that allows adding piped commands (either locally or remote via ssh).

@psy0rz psy0rz added this to the 3.1 milestone Apr 30, 2020
@psy0rz
Copy link
Owner

psy0rz commented Jul 25, 2020

Depends on #50

@psy0rz psy0rz closed this as completed in 30f30ba May 15, 2021
@psy0rz
Copy link
Owner

psy0rz commented May 15, 2021

i added the same compression options as syncoid. I changed it so zstd uses zsdtmt for multithreading.

I would recommend zstd-fast as default compressor. (Its very fast and compresses pretty good)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants