-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: io/fs.Hash and Hasher #43076
Comments
On further consideration, crypto.Hash is better than hash.Hash. But that would almost certainly require not putting it into io/fs. |
If you meant Also, I think we should cater for a "just give me whatever hash you happen to have" use; e.g. ETags don't care. |
I was thinking there was a way to look at a hash.Hash and figure out that it was an MD5 hasher, but I think I was thinking of crypto.Hash. The fact that cypto.Hash can't be extended outside stdlib is bad, yes. I guess the question is whether just having any hash is good enough or users will care to be able to ensure that the hash is a particular one. https://godoc.org/gocloud.dev/blob#Attributes is probably what should be used as prior art and copied:
|
Some alternatives:
|
So what if the proposal is instead type ReadETagFS interface {
fs.FS
ReadETag(path string) (string, error)
} and |
I wouldn't tie
(And then error can be whatever I'm returning I think the use case for asking for a specific hash is valid, but that can be a separate concern. Example: To upload to S3, I can ensure integrity by providing an MD5 as |
Requirements for an ideal "do you have a shortcut for my custom hash" extension interface:
Note that there's no guarantee that the returned hash.Hash has the same underlying type as e.g. Note that the Example simple use:
Example advanced use:
(That would need an extra non-stdlib API that https://godoc.org/lukechampine.com/blake3 doesn't provide, to construct a keyed hasher from its marshaled internal state.) |
And the alternative for
I'd still document some base assumptions, like minimum 8 byte length, in the |
Personally, I don't like the idea of identifying the hash algorithm via a string. It's too easy to mistype and there is no compiler-check. I think a better solution would be an exported constant, so you'd pass However, really, I don't know if it even makes sense to pass a hash algorithm. From what I can tell, a So, personally, I don't really think it makes a lot of sense to pass a hash function of any sorts to the FS. At least not as part of a generalized filesystem API. For the ETag use-case, an optional interface like @tv42's |
Yes, to use a precomputed hash, a hash has to be precomputed. I'm not sure what alternative there is. To continue the upload to S3 example: not having a prehashed MD5 just means you don't get end-to-end protection against corruption. The source of the data may have precomputed multiple hashes, so it's not just "filesystem itself chooses the Algorithm". Consider e.g. serving HTTP with assets in a cloud object store. Object stores today might already store MD5 and SHA256. And moving forward, they are likely to deprecate MD5 and add something else. Plus one can annotate the objects with extra metadata, storing the hash du jour there even if the object store itself doesn't understand it. Yes there is no magic to transmogrify MD5 into a SHA256, you only get end-to-end guarantees or save work if the marketplace of ideas settles on a small handful of popular hashes. I'd hate to end with a design that rules out the possibility of using much faster hashes like Blake3, when all steps in the processing chain agree. Consider a content management system with assets in an object store, and adding Subresource Integrity to links based on what hashes are available: https://w3c.github.io/webappsec-subresource-integrity/#agility
All of this can be done without using Writing the above raised a new thought: The caller might want to enumerate what hashes the pathname has precomputed. Not sure how to formulate that into an API well, at this time. The SRI use case needs to map them to standardized name prefixes anyway, so it has a list of what it knows how to handle. So maybe we flip the API around: let the filesystem tell the caller what hashes it has precomputed:
Example use:
Does that please you more? Now the FS just says what it knows, yet it's still extensible the same way. The biggest downside of that is that now the FS has to construct the full map, even if caller only cares about just SHA256. ETag use case is a little awkward with a map, it can't just pick first thing
(EDIT: that API can be minimized a little by removing
|
Then we disagree. I think Also note, that while the interface doesn't guarantee the hash, there is nothing preventing an
where And yes. I am fully aware that even that still leaves some cases open (e.g. it doesn't allow a filesystem to gradually migrate hash algorithms file-by-file). But at some point, we have to ask ourselves if the extra use-cases we cover really justify the extra complexity we add to the API. As a person who primarily consumes Go APIs and doesn't provide them, I can say that I tend to pass on APIs that try to solve all problems. Because that tends to make it very hard to solve my problem. |
It's worth taking a step back and asking what the problem is and why it needs to be solved. I see the problem as "some filesystems have precomputed hashes but http.FileSystem cannot use those hashes for setting an ETag". (There's another question of file integrity checks, but fs.FS doesn't allow writing files, so I don't think that's really relevant now.) As a side note, an interesting question is whether there would be some way of exposing e.g. ZFS integrity hashes from the on disk filesystem. Okay, so then there should be some way of asking whether for a given path, a hash exists and secondarily if the hash is the right one for a given purpose. How about something like:
Do we need anything more complex than that? |
@carlmjohnson I think that's more complicated than needed. But is that the only use-case? It certainly is the one I'd care about, personally. But I might be biased by my own experience. I agree that talking about the actual concrete use-cases we want to solve should come first, though, not the solution. And FTR, even "add an optional |
@carlmjohnson That fixes the FS to providing just one hash algorithm. Also, I'm not really happy with communicating the hash algorithm by string: 1) there are no standard names for hashes, it's "SHA-384" in one place and "sha384" in another, so anyone communicating the hash in some protocol needs a mapping anyway; the returned value can implement Stringer for human output purposes 2) forcing a parameterized hash description to a string means next you're parsing that string, and that's just silly within the same program. @Merovius I like the idea of returning an interface, because that allows modern sponge-design hashes to provide an underlying type that implements arbitrary-length output. Modern hashes are all migrating that way: https://godoc.org/golang.org/x/crypto/sha3#ShakeHash https://godoc.org/lukechampine.com/blake3#Hasher.XOF Beyond those three concerns, please do come up with a simpler API. I've tried. Here's one more attempt, taking inspiration from
Example use:
|
Assuming those concerns, there might not be. Which is why, to repeat myself, I think there should be some discussion of what we actually need. The only real tangible use-case I have a grasp on, right now, is ETags (and similar) and for those, I'd argue we don't need arbitrary length hash outputs - any collision resistant hash with at least, say, 128 bit will be sufficient. And we'd also wouldn't need to introduce all the complexity of negotiating hash algorithms. I don't see the point arguing about the best APIs, if we don't agree on what the requirements are first. Once that has happened, the API should come pretty naturally. |
If we just want to solve ETag "give me something that changes if the file content changes, I don't care what", then #43076 (comment) is all that's needed. That means we will not have a standard solution that can cope with e.g. Subresource Integrity. It's all trade-offs. |
That would be an example of a concrete use-case to discuss, thank you. I agree that this would require you to know what the hash is and limit it to a specific set of options. |
FWIW, @benbjohnson just dropped this: https://github.com/benbjohnson/hashfs |
Closing as a duplicate of #43223. We can keep discussing there. |
Background:
Proposal:
Add a func to io/fs to support getting the hash of file.
Perhaps this?
If there's a dependency import issue, maybe put it into package hash.
Ref. #41191 (comment)
The text was updated successfully, but these errors were encountered: