-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determining input equivalence with File checksums and Directory listings #472
Comments
This is a great idea @jdidion and is something I have been wondering about myself. One quick question I have here, would you consider exposing some of those properties within wdl proper. A File might only ever be assigned with a url, but once defined should it have attributes (checksum, basename, size even?). |
Well, remember that we already have a |
You are correct (My late night brain jumped a bit to far down the OOP path) |
If we are going for checksums might I jump in and prevent us from using SHA256? May I suggest using XXHash? It is extremely fast. I already implemented XXHash into cromwell. As Java-bindings and Python bindings are available it is really no strain on engine developers. QED (benchmarks with hyperfine):
xxh128sum is 12x faster than md5sum and 30x faster than sha256sum. I think sha512sum is a lot faster than sha256sum is due to hardware optimizations. Also 64-bit and 128-bit hashes can be represented as 16-char and 32-char hex-strings which is much easier to type than 64-char hex-strings. (I prefer the 16-char ones, much easier to check for typos/copy-paste errors!). |
Hi all, just a comment here, I'd consider generalizing/weakening this to say that the File/Directory representation may be a JSON object with a |
@mlin Good point. Checksum shouldn't be required. But we can make a suggestion to use checksums or some other means of determining file equality. And to Ruben's point we can suggest checksum algorithms, but require that any specific algorithm be used. |
Partially addressed in WDL 1.2. I think the open question is whether to make recommendations in the spec around checksums. |
A shortcoming of the
File
andDirectory
data types is the inability to guarantee the equivalence of inputs, which breaks reproducibility and makes the implementation of call caching (which we call "job reuse" on DNAnexus) difficult. Say I have the following directory in my cloud storage:I run a job with a
Directory
-type input and providefoo
as the value. Now I add a new fileblorf
to thefoo
directory, and I replacebaz
with a new file of the same name but with different contents. I run the job again. Given just the directory name, how does my implementation know that the contents of that directory have remained unchanged since the first time I ran the job?I propose to formally define two alternative JSON formats for the
File
andDirectory
types in the standard input/output formats. Rather than reinvent the wheel, we will just borrow the terminology from CWL.In the simple form, a
File
/Directory
value is just a string - typically a local path or a URI. The object forms have the following fields:File
type
: Always "File"; optional at the top-level but required within directory listingslocation
: The file location - this is equivalent to the value in the simple form. May be absent if the file is within a listing as long asbasename
is specified.basename
: The name of the file relative to the containing directory. If the basename differs from the actual file name at the given location, the file must be localized with the given basename.checksum
: A checksum of the file using one of the approved algorithms. If specified, the checksum must be verified during localization.Directory
type
: Always "Directory"; optional at the top-level but required within directory listingslocation
: The directory location - this is equivalent to the value in the simple form. May be absent if the directory is within a listing as long asbasename
is specified.basename
: The name of the directory relative to the containing directory. If the basename differs from the actual directory name at the given location, the file must be localized with the given basename. Iflocation
is not specified, thenbasename
andlisting
are required, and all files/directories in the listing must have a location that is an absolute path or URI.listing
: An array of files/subdirectories within the directory. May be nested to any degree.Importantly, none of these fields will be exposed within WDL, so the runtime definition of
File
/Directory
won't change.Draft implementation: https://github.com/openwdl/wdl/tree/472-directory-listing
The text was updated successfully, but these errors were encountered: