Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determining input equivalence with File checksums and Directory listings #472

Closed
jdidion opened this issue Aug 10, 2021 · 7 comments
Closed
Assignees
Labels
K-request-for-comment (Kind) A request for comment (RFC). L-enhancement (Legacy) An enhancement to the WDL language. Z-specification-change (Metadata) An issue or PR related to a specification change.
Milestone

Comments

@jdidion
Copy link
Collaborator

jdidion commented Aug 10, 2021

A shortcoming of the File and Directory data types is the inability to guarantee the equivalence of inputs, which breaks reproducibility and makes the implementation of call caching (which we call "job reuse" on DNAnexus) difficult. Say I have the following directory in my cloud storage:

foo
|_bar
|_baz

I run a job with a Directory-type input and provide foo as the value. Now I add a new file blorf to the foo directory, and I replace baz with a new file of the same name but with different contents. I run the job again. Given just the directory name, how does my implementation know that the contents of that directory have remained unchanged since the first time I ran the job?

I propose to formally define two alternative JSON formats for the File and Directory types in the standard input/output formats. Rather than reinvent the wheel, we will just borrow the terminology from CWL.

{
  "file1": "/path/to/file",
  "dir1": "/path/to/dir",
  "dir2": {
    "location": "/path/to/dir",
    "listing": [
      {
        "type": "File",
        "basename": "foo.txt",
        "checksum": "sha256:ABC123"
      },
      {
        "type": "Directory",
        "basename": "bar"
        "listing": [
          {
            "type": "File",
            "basename": "baz.txt",
            "checksum": "md5:WTFBBQ42"
          }
        ]
      }
    ]
  },
  "dir3": {
    "basename": "fakedir",
    "listing": [
      {
        "type": "File",
        "location": "/path/to/foo.txt",
        "basename": "bar.txt",
        "checksum": "sha256:ABC123"
      },
    ]
  }
}

In the simple form, a File/Directory value is just a string - typically a local path or a URI. The object forms have the following fields:

  • File
    • type: Always "File"; optional at the top-level but required within directory listings
    • location: The file location - this is equivalent to the value in the simple form. May be absent if the file is within a listing as long as basename is specified.
    • basename: The name of the file relative to the containing directory. If the basename differs from the actual file name at the given location, the file must be localized with the given basename.
    • checksum: A checksum of the file using one of the approved algorithms. If specified, the checksum must be verified during localization.
  • Directory
    • type: Always "Directory"; optional at the top-level but required within directory listings
    • location: The directory location - this is equivalent to the value in the simple form. May be absent if the directory is within a listing as long as basename is specified.
    • basename: The name of the directory relative to the containing directory. If the basename differs from the actual directory name at the given location, the file must be localized with the given basename. If location is not specified, then basename and listing are required, and all files/directories in the listing must have a location that is an absolute path or URI.
    • listing: An array of files/subdirectories within the directory. May be nested to any degree.

Importantly, none of these fields will be exposed within WDL, so the runtime definition of File/Directory won't change.

Draft implementation: https://github.com/openwdl/wdl/tree/472-directory-listing

@jdidion jdidion added L-enhancement (Legacy) An enhancement to the WDL language. Z-specification-change (Metadata) An issue or PR related to a specification change. K-request-for-comment (Kind) A request for comment (RFC). labels Aug 10, 2021
@jdidion jdidion added this to the v2.0 milestone Aug 10, 2021
@patmagee
Copy link
Member

This is a great idea @jdidion and is something I have been wondering about myself. One quick question I have here, would you consider exposing some of those properties within wdl proper. A File might only ever be assigned with a url, but once defined should it have attributes (checksum, basename, size even?).

@jdidion
Copy link
Collaborator Author

jdidion commented Aug 11, 2021

Well, remember that we already have a size function. If we want to expose any other attributes (which I don't think is necessary), we should also do so with functions.

@patmagee
Copy link
Member

You are correct (My late night brain jumped a bit to far down the OOP path)

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Oct 8, 2021

If we are going for checksums might I jump in and prevent us from using SHA256?
That is a cryptographic hash. Cryptographic hashes are designed to be slow, as to prevent brute-force attacks. That is overkill for a file. We just want a cyclic redundancy check to make sure the file is the same. In bioinformatics we use 50GB+ files quite often so it makes quite a difference whether a fast or slow hash is used.

May I suggest using XXHash? It is extremely fast. I already implemented XXHash into cromwell. As Java-bindings and Python bindings are available it is really no strain on engine developers.

QED (benchmarks with hyperfine):

$ du -h big2.fastq.gz
657M	big2.fastq.gz

Benchmark #1: md5sum big2.fastq.gz
  Time (mean ± σ):     939.8 ms ±  12.8 ms    [User: 891.8 ms, System: 48.0 ms]
  Range (min … max):   919.2 ms … 958.1 ms    10 runs

Benchmark #1: sha1sum big2.fastq.gz
  Time (mean ± σ):     925.2 ms ±  10.5 ms    [User: 878.7 ms, System: 46.4 ms]
  Range (min … max):   903.9 ms … 941.8 ms    10 runs
 
Benchmark #1: sha256sum big2.fastq.gz
  Time (mean ± σ):      2.322 s ±  0.024 s    [User: 2.273 s, System: 0.049 s]
  Range (min … max):    2.295 s …  2.365 s    10 runs

Benchmark #1: sha384sum big2.fastq.gz
  Time (mean ± σ):      1.596 s ±  0.008 s    [User: 1.552 s, System: 0.044 s]
  Range (min … max):    1.587 s …  1.614 s    10 runs

Benchmark #1: sha512sum big2.fastq.gz
  Time (mean ± σ):      1.611 s ±  0.014 s    [User: 1.573 s, System: 0.038 s]
  Range (min … max):    1.582 s …  1.632 s    10 runs

Benchmark #1: xxh32sum big2.fastq.gz
  Time (mean ± σ):     138.5 ms ±  11.1 ms    [User: 91.1 ms, System: 47.4 ms]
  Range (min … max):   127.2 ms … 150.0 ms    10 runs

Benchmark #1: xxh64sum big2.fastq.gz
  Time (mean ± σ):      99.1 ms ±   9.4 ms    [User: 47.8 ms, System: 51.3 ms]
  Range (min … max):    84.6 ms … 106.4 ms    10 runs

Benchmark #1: xxh128sum big2.fastq.gz
  Time (mean ± σ):      84.9 ms ±   8.6 ms    [User: 37.2 ms, System: 47.7 ms]
  Range (min … max):    72.6 ms …  91.5 ms    10 runs

xxh128sum is 12x faster than md5sum and 30x faster than sha256sum. I think sha512sum is a lot faster than sha256sum is due to hardware optimizations.

Also 64-bit and 128-bit hashes can be represented as 16-char and 32-char hex-strings which is much easier to type than 64-char hex-strings. (I prefer the 16-char ones, much easier to check for typos/copy-paste errors!).

@mlin
Copy link
Member

mlin commented Mar 22, 2023

Hi all, just a comment here, I'd consider generalizing/weakening this to say that the File/Directory representation may be a JSON object with a location key and whatever else the engine may wish to include or interpret. Going beyond that may be too prescriptive of implementation details. For example, miniwdl's call cache just uses the filesystem mtimes instead of digests.

@jdidion jdidion moved this to Todo in WDL v1.2 Mar 23, 2023
@jdidion jdidion added this to WDL v1.2 Mar 23, 2023
@jdidion jdidion modified the milestones: v2.0, 1.2 Mar 23, 2023
@jdidion
Copy link
Collaborator Author

jdidion commented Mar 29, 2023

@mlin Good point. Checksum shouldn't be required. But we can make a suggestion to use checksums or some other means of determining file equality. And to Ruben's point we can suggest checksum algorithms, but require that any specific algorithm be used.

@jdidion jdidion self-assigned this Mar 29, 2023
@jdidion jdidion moved this from Todo to In Progress in WDL v1.2 Mar 29, 2023
@jdidion jdidion moved this from In Progress to Drafted in WDL v1.2 Feb 1, 2024
@jdidion jdidion removed their assignment Mar 28, 2024
@vsmalladi vsmalladi moved this from Drafted to In Review in WDL v1.2 May 11, 2024
@vsmalladi vsmalladi self-assigned this May 11, 2024
@jdidion
Copy link
Collaborator Author

jdidion commented May 15, 2024

Partially addressed in WDL 1.2. I think the open question is whether to make recommendations in the spec around checksums.

@jdidion jdidion modified the milestones: 1.2, 1.3 May 15, 2024
@jdidion jdidion closed this as completed May 16, 2024
@github-project-automation github-project-automation bot moved this from In Review to Done in WDL v1.2 May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
K-request-for-comment (Kind) A request for comment (RFC). L-enhancement (Legacy) An enhancement to the WDL language. Z-specification-change (Metadata) An issue or PR related to a specification change.
Projects
No open projects
Status: Done
Development

No branches or pull requests

5 participants