Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding checksum functions #122

Merged
merged 15 commits into from
Mar 25, 2021
Merged

Adding checksum functions #122

merged 15 commits into from
Mar 25, 2021

Conversation

florisvdh
Copy link
Member

@florisvdh florisvdh commented Mar 18, 2021

@cecileherr can you test these, e.g. by running the example or for a Zenodo file?

Adding these functions was more of an adventure than originally expected. After writing, I discovered there is also tools::md5sum() 🤔. However in general I believe it's best to lean on OpenSSL as a standard; tools::md5sum() seems to be hardcoded in R , and from its docs, may not always work the same on all platforms. Also have struggled with silently closing connections until I found the function itself was not the cause, but the example I ran... At least I think that now (please look for warnings).

sha256sum() can be used to generate and store sha256sums as file metadata. I think this implementation is not available in core R packages.

It appears to have been a fix for a missing close() statement
in the examples (after using writeLines()), rather than a
problem with the function.
It was done to avoid R auto-closing dangling connections with
a warning, a bit later.
@florisvdh florisvdh requested a review from cecileherr March 18, 2021 17:43
@florisvdh
Copy link
Member Author

BTW openssl is an imported package, assuming it will eventually be called (indirectly) by most functions.

@florisvdh
Copy link
Member Author

@hansvancalster I just saw that n2khab::download_zenodo() - yep the original one, which is much quicker than zen4R and I prefer the most - uses tools::md5sum(). What's your opinion for this PR?

  • OpenSSL is generally referred as standard implementation for hashing, regardless of language, and openssl, which uses the OpenSSL library, can be used to calculate other file checksums as well in R, such as sha256 which I also would like to store within the package next to md5 (independently of Zenodo), even though we'll by default use md5 to check.
  • also the digest package refers to OpenSSL for most rigorous application
  • tools::md5sum() is a hardcoded, standalone implementation of the md5 algorithm in R itself from 2001. Its docs say on Windows the binary mode is used and text mode on other platforms, but as I understand from here, for *nix systems there should be no difference between the binary and text mode. Indeed, we had no problems using download_zenodo() on various systems. Still, reading the docs of tools::md5sum() feels a bit like the function is a tailor-made approach.

Now, I was intending to suggest switching download_zenodo() to using openssl::md5() as well (through n2khab::md5sum()), for the sake of conformity. However I now see tools::md5sum() is a tiny bit speedier, for a 1.2 GB file (see below). That keeps me hesitating to use n2khab::md5sum() so I'm interested to hear you.

Note, the reason for not using openssl::md5() or openssl::sha256() directly is because those work on a single connection object, not a vector of file paths, and also because their output format is a little different. These things are tackled by n2khab in this PR.

soilmapdbf <- file.path(n2khab::fileman_up("n2khab_data"),
                        "10_raw/soilmap/soilmap.dbf")
invisible(n2khab::md5sum(soilmapdbf)) # load in memory

system.time(openssl::md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.885   0.292   3.180
system.time(openssl::md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.699   0.236   2.935

system.time(n2khab::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.745   0.240   2.986
system.time(tools::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.516   0.220   2.737

system.time(n2khab::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.668   0.252   2.921
system.time(tools::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.555   0.199   2.755

system.time(n2khab::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.651   0.315   2.967
system.time(tools::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.513   0.256   2.769

system.time(openssl::md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.693   0.312   3.005
system.time(openssl::md5(file(soilmapdbf)))
#>    user  system elapsed 
#>   2.769   0.244   3.013

Created on 2021-03-19 by the reprex package (v1.0.0)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.4 (2021-02-15)
#>  os       Linux Mint 20               
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language nl_BE:nl                    
#>  collate  nl_BE.UTF-8                 
#>  ctype    nl_BE.UTF-8                 
#>  tz       Europe/Brussels             
#>  date     2021-03-19                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                       
#>  askpass       1.1        2019-01-13 [1] CRAN (R 4.0.2)               
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.2)               
#>  class         7.3-18     2021-01-24 [4] CRAN (R 4.0.3)               
#>  classInt      0.4-3      2020-04-07 [1] CRAN (R 4.0.2)               
#>  cli           2.3.0      2021-01-31 [1] CRAN (R 4.0.3)               
#>  crayon        1.4.1      2021-02-08 [1] CRAN (R 4.0.3)               
#>  DBI           1.1.1      2021-01-15 [1] CRAN (R 4.0.3)               
#>  digest        0.6.27     2020-10-24 [1] CRAN (R 4.0.3)               
#>  dplyr         1.0.4      2021-02-02 [1] CRAN (R 4.0.3)               
#>  e1071         1.7-4      2020-10-14 [1] CRAN (R 4.0.3)               
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.2)               
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.2)               
#>  forcats       0.5.1      2021-01-27 [1] CRAN (R 4.0.3)               
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)               
#>  generics      0.1.0      2020-10-31 [1] CRAN (R 4.0.3)               
#>  git2r         0.28.0     2021-01-10 [1] CRAN (R 4.0.3)               
#>  git2rdata     0.3.1      2021-01-21 [1] CRAN (R 4.0.3)               
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)               
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.2)               
#>  htmltools     0.5.1.1    2021-01-22 [1] CRAN (R 4.0.3)               
#>  KernSmooth    2.23-18    2020-10-29 [4] CRAN (R 4.0.3)               
#>  knitr         1.31       2021-01-27 [1] CRAN (R 4.0.3)               
#>  lifecycle     1.0.0      2021-02-15 [1] CRAN (R 4.0.4)               
#>  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.0.3)               
#>  n2khab        0.4.0.9000 2021-03-19 [1] Github (inbo/n2khab@395e504) 
#>  openssl       1.4.3      2020-09-18 [1] CRAN (R 4.0.2)               
#>  pillar        1.4.7      2020-11-20 [1] CRAN (R 4.0.3)               
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.2)               
#>  plyr          1.8.6      2020-03-03 [1] CRAN (R 4.0.2)               
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.0.2)               
#>  R6            2.5.0      2020-10-28 [1] CRAN (R 4.0.3)               
#>  Rcpp          1.0.6      2021-01-15 [1] CRAN (R 4.0.3)               
#>  reprex        1.0.0      2021-01-27 [1] CRAN (R 4.0.3)               
#>  rlang         0.4.10     2020-12-30 [1] CRAN (R 4.0.3)               
#>  rmarkdown     2.6        2020-12-14 [1] CRAN (R 4.0.3)               
#>  rprojroot     2.0.2      2020-11-15 [1] CRAN (R 4.0.3)               
#>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.0.3)               
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.2)               
#>  sf            0.9-8      2021-03-04 [1] Github (florisvdh/sf@34434d1)
#>  stringi       1.5.3      2020-09-09 [1] CRAN (R 4.0.2)               
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.2)               
#>  tibble        3.0.6      2021-01-29 [1] CRAN (R 4.0.3)               
#>  tidyr         1.1.2      2020-08-27 [1] CRAN (R 4.0.2)               
#>  tidyselect    1.1.0      2020-05-11 [1] CRAN (R 4.0.2)               
#>  units         0.6-7      2020-06-13 [1] CRAN (R 4.0.2)               
#>  vctrs         0.3.6      2020-12-17 [1] CRAN (R 4.0.3)               
#>  withr         2.4.1      2021-01-26 [1] CRAN (R 4.0.3)               
#>  xfun          0.21       2021-02-10 [1] CRAN (R 4.0.4)               
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.2)               
#> 
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library

@florisvdh
Copy link
Member Author

florisvdh commented Mar 19, 2021

One way could be to allocate tools::md5sum to n2khab::md5sum in order to provide direct access to the function, and leave n2khab::sha256sum as is. And because n2khab functions will call the former, drop openssl as a dependency - move it to Suggests (just like tools, but anyone has tools installed with R I think). That leaves us with a mixed approach for checksum calculations in n2khab (tools + openssl), but one package less to load and fastest implementation.

@hansvancalster
Copy link
Contributor

hansvancalster commented Mar 19, 2021

If speed is a concern, you could also consider https://github.com/Cyan4973/xxHash, which is implemented via digest, see https://github.com/ropensci/targets/blob/main/R/utils_digest.R (I always check the targets package - the author of the package choses often the best performing tools).

EDIT: but that algorithm is not available via openssl

@hansvancalster
Copy link
Contributor

hansvancalster commented Mar 19, 2021

Note that openssl::multihash() could be useful:

When using a connection input in the multihash function, the data is only read only once while streaming to multiple hash functions simultaneously. Therefore several hashes are calculated simultanously, without the need to store any data or download it multiple times.

@florisvdh
Copy link
Member Author

florisvdh commented Mar 19, 2021

Thanks @hansvancalster! I'll have another look at xxHash - I've encountered it when exploring this subject, but missed its R implementation. Indeed I tend to choose more established (more widely screened) ones such as md5 / sha256. For verifying larger amounts of files (e.g. the whole n2khab_data folder) it may be worth it though. On the other hand I estimate > 80% time goes to reading from disk (with my HDD; will be much less on your fast SSD), so we cannot improve that part.

openssl::multihash() is indeed interesting, thanks for investigating! We'll only use one hash for checking files though, but it could be useful to generate all hashes at once after creating new files.

@florisvdh
Copy link
Member Author

florisvdh commented Mar 19, 2021

Note that download_zenodo() uses md5 by design: it checks against the md5 metadata from Zenodo. So there remains the question about tools vs. openssl.

@hansvancalster
Copy link
Contributor

Note that download_zenodo() uses md5 by design: it checks against the md5 metadata from Zenodo. So there remains the question about tools vs. openssl.

Indeed; for that case, the speed difference is too minor IMO to decide among them. So I have no strong opinion. Maybe you are right that openssl is a better implementation.

@florisvdh
Copy link
Member Author

OK, it appears that digest is also equipped to hash files, next to R objects (it advertises the capability for R objects) 👍. Didn't catch that - it makes one package more to consider.

See also #112 (comment), where some findings about xxHash have been added.

Some timings:

soilmapdbf <- file.path(n2khab::fileman_up("n2khab_data"),
                        "10_raw/soilmap/soilmap.dbf")
invisible(n2khab::md5sum(soilmapdbf)) # load in memory

system.time(digest::digest(soilmapdbf, algo = "xxhash64", file = TRUE))
#>    user  system elapsed 
#>   0.285   0.208   0.494
system.time(digest::digest(soilmapdbf, algo = "md5", file = TRUE))
#>    user  system elapsed 
#>   3.539   0.280   3.819
system.time(n2khab::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.690   0.292   2.983
system.time(tools::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.521   0.264   2.786

system.time(digest::digest(soilmapdbf, algo = "xxhash64", file = TRUE))
#>    user  system elapsed 
#>   0.296   0.200   0.496
system.time(digest::digest(soilmapdbf, algo = "md5", file = TRUE))
#>    user  system elapsed 
#>   3.663   0.152   3.816
system.time(n2khab::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.715   0.272   2.987
system.time(tools::md5sum(soilmapdbf))
#>    user  system elapsed 
#>   2.521   0.264   2.785

Created on 2021-03-22 by the reprex package (v1.0.0)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.4 (2021-02-15)
#>  os       Linux Mint 20               
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language nl_BE:nl                    
#>  collate  nl_BE.UTF-8                 
#>  ctype    nl_BE.UTF-8                 
#>  tz       Europe/Brussels             
#>  date     2021-03-22                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                       
#>  askpass       1.1        2019-01-13 [1] CRAN (R 4.0.2)               
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.2)               
#>  class         7.3-18     2021-01-24 [4] CRAN (R 4.0.3)               
#>  classInt      0.4-3      2020-04-07 [1] CRAN (R 4.0.2)               
#>  cli           2.3.0      2021-01-31 [1] CRAN (R 4.0.3)               
#>  crayon        1.4.1      2021-02-08 [1] CRAN (R 4.0.3)               
#>  DBI           1.1.1      2021-01-15 [1] CRAN (R 4.0.3)               
#>  digest        0.6.27     2020-10-24 [1] CRAN (R 4.0.3)               
#>  dplyr         1.0.4      2021-02-02 [1] CRAN (R 4.0.3)               
#>  e1071         1.7-4      2020-10-14 [1] CRAN (R 4.0.3)               
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.2)               
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.2)               
#>  forcats       0.5.1      2021-01-27 [1] CRAN (R 4.0.3)               
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)               
#>  generics      0.1.0      2020-10-31 [1] CRAN (R 4.0.3)               
#>  git2r         0.28.0     2021-01-10 [1] CRAN (R 4.0.3)               
#>  git2rdata     0.3.1      2021-01-21 [1] CRAN (R 4.0.3)               
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)               
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.2)               
#>  htmltools     0.5.1.1    2021-01-22 [1] CRAN (R 4.0.3)               
#>  KernSmooth    2.23-18    2020-10-29 [4] CRAN (R 4.0.3)               
#>  knitr         1.31       2021-01-27 [1] CRAN (R 4.0.3)               
#>  lifecycle     1.0.0      2021-02-15 [1] CRAN (R 4.0.4)               
#>  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.0.3)               
#>  n2khab        0.4.0.9000 2021-03-19 [1] Github (inbo/n2khab@395e504) 
#>  openssl       1.4.3      2020-09-18 [1] CRAN (R 4.0.2)               
#>  pillar        1.4.7      2020-11-20 [1] CRAN (R 4.0.3)               
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.2)               
#>  plyr          1.8.6      2020-03-03 [1] CRAN (R 4.0.2)               
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.0.2)               
#>  R6            2.5.0      2020-10-28 [1] CRAN (R 4.0.3)               
#>  Rcpp          1.0.6      2021-01-15 [1] CRAN (R 4.0.3)               
#>  reprex        1.0.0      2021-01-27 [1] CRAN (R 4.0.3)               
#>  rlang         0.4.10     2020-12-30 [1] CRAN (R 4.0.3)               
#>  rmarkdown     2.6        2020-12-14 [1] CRAN (R 4.0.3)               
#>  rprojroot     2.0.2      2020-11-15 [1] CRAN (R 4.0.3)               
#>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.0.3)               
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.2)               
#>  sf            0.9-8      2021-03-04 [1] Github (florisvdh/sf@34434d1)
#>  stringi       1.5.3      2020-09-09 [1] CRAN (R 4.0.2)               
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.2)               
#>  tibble        3.0.6      2021-01-29 [1] CRAN (R 4.0.3)               
#>  tidyr         1.1.2      2020-08-27 [1] CRAN (R 4.0.2)               
#>  tidyselect    1.1.0      2020-05-11 [1] CRAN (R 4.0.2)               
#>  units         0.6-7      2020-06-13 [1] CRAN (R 4.0.2)               
#>  vctrs         0.3.6      2020-12-17 [1] CRAN (R 4.0.3)               
#>  withr         2.4.1      2021-01-26 [1] CRAN (R 4.0.3)               
#>  xfun          0.21       2021-02-10 [1] CRAN (R 4.0.4)               
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.2)               
#> 
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library

Sidenote: the md5 hashing by digest is clearly slower than the implementations of tools and openssl.

I am quite attracted by the much higher speed of xxHash. So I might want to take the small risk of using XXH64 for now, over MD5 or SHA256. ('Risk': its hash is less unique, its R implementation is tailor-made - not linked, and the technology is not as well-known). Even though disk reading time still matters more. We can still continue to store the MD5/SHA256 as a backup, and perhaps expose the algorithm choice as a user setting.

In order to maintain future flexibility with reference to hash algorithm choice, I intend to rewrite this PR a bit so that a more generic hash function is provided e.g. with xxh64, md5 and sha256 as options, and one set as default. When using this function throughout n2khab, we can code it such that it just needs a change in its default, to let the whole package switch to using another hash (by default).

@cecileherr
Copy link
Collaborator

Thanks for the useful addition!

I cannot comment on the choice of the package since I am not experienced enough, but I tested the new functions.

I tried with single and multiple files, and with different file types, and md5sum() and sha256sum() seem to work as expected, except for the assert_that_allfiles_exist part.

  • The problem with assert_that_allfiles_exist:

As far as I understand assert_that checks whether a file or a folder exists, but this can be a problem in this case because md5sum() and sha256sum() expect files and will give an error if used with folders.

myfile1 <- file.path(fileman_up("n2khab_data"), "10_raw/watersurfaces/watersurfaces.gpkg")
myfile2 <- file.path(fileman_up("n2khab_data"), "10_raw/habitatstreams") 
# a folder with a shp (would be recognized by read_sf but will not work with md5sum() and sha256sum() )

md5sum(c(myfile1, myfile2))
# Error in open.connection(con, "rb") : cannot open the connection 
# do not trigger assert_that_allfiles_exist (because assert_that_allfiles_exist(myfile2) = TRUE)

A side note: because of this difference in usage between md5sum() and sha256sum() and the files/paths recognized by read_sf in the read_xxx functions (with read_sf you don't need to mention n2khab_data/10_raw/watersurfaces/watersurfaces.shp, n2khab_data/10_raw/watersurfaces will also work), we might have to be more explicit in the defintion of the file argument in the read_xx functions.

  • One more detail:

When I encounter an error (in this case on purpose), the next time I use the function I will get a warning (even though the second try works smoothly). Not really a problem, but it might be intriguing/disturbing for users.

# trigger an error:
myfile <- file.path(fileman_up("n2khab_data"), "10_raw/watersurfaces")
md5sum(myfile)
 # Error in open.connection(con, "rb") : cannot open the connection 

# run for an existing file:
myfile <- file.path(fileman_up("n2khab_data"), "10_raw/watersurfaces/watersurfaces.shp")
md5sum(myfile)
#                 watersurfaces.shp 
# "935ab52ef0dc32dd8b028894c08c3e79"
#Warning message:
#In match.fun(FUN) :
#  closing unused connection 3 (C:\Users\cecile_herr\GitHubRepos\n2khab_data/10_raw/watersurfaces)

@florisvdh
Copy link
Member Author

Thank you for testing @cecileherr!

md5sum() and sha256sum() expect files and will give an error if used with folders

Yes, that's intended behaviour. File hashes are calculated for files only. I'll add a check to error on folders.

When I encounter an error (in this case on purpose), the next time I use the function I will get a warning

Well spotted 👍. I overlooked the case where the function would error because a folder is provided instead of a file. Although the hashing functions are primarily intended for internal usage, I exported them so it's better to harden them as well. Will try to avoid dangling connections (in case of an error) by re-adopting dropped parts of 1460b88.

we might have to be more explicit in the definition of the file argument in the read_xx functions

For shapefiles, we could indeed replace the folder with the *.shp file as default - that should also work for sf and it may appear more logical. Nevertheless the approach cannot cover cases where a multi-layer data source is actually defined by different files (GRTSmh_diffres is in that case) - so for that reason we could as well leave all current default folder arguments as-is. I would rather not invest too much in that, primarily because of the following.

For shapefiles (and things like GRTSmh_diffres) you would need to provide all files as default argument if you also want to use that argument for file hashing. I believe that is not ideal - for now it would unnecessarily complicate things IMHO. It would duplicate the information in https://docs.google.com/spreadsheets/d/1E8ERlfYwP3OjluL8d7_4rR1W34ka4LRCE35JTxf3WMI/edit#gid=2100595853. The latter is intended to end up as an internal table in n2khab, and will already be queried with an (internal) function to know which files (and expected hashes) are linked to which data source versions (the latter also need definition in a local table). Hardcoding those data and doing file integrity checking inside an n2khab function should now be limited to cases where it is rather necessary, as it will become obsolete once package-wide file/version handling is in place. I.e. the case you referred to where one read_xxx() function calls another - it is probably a single case in n2khab.

So, in the end all functions should rather have a file = NULL argument (just kept for backward compatibility). The file integrity check will happen in the context of another generic check function, which all functions can call while passing data source and version as argument to it - something along that approach (and more). We can discuss it more elaborately if you want.

@florisvdh
Copy link
Member Author

md5sum() and sha256sum() expect files and will give an error if used with folders

Should be solved by 0cec424.

Will try to avoid dangling connections (in case of an error) by re-adopting dropped parts of 1460b88.

Update: I think this is not needed. Currently only actual files are accepted, by 0cec424. So the connection error should never happen, unless on rare occasions like I/O errors or something like that (but that would yield more error messages anyway). Do you agree @cecileherr?

@florisvdh
Copy link
Member Author

florisvdh commented Mar 24, 2021

General function checksum() has been added, this is the one to be used in other functions - then the default hash algorithm can be set in the source code of checksum().

Currently defaults to the XXH64 hash function.

library(n2khab)

# creating two different temporary files:
file1 <- tempfile()
file2 <- tempfile()
files <- c(file1, file2)
file.create(files)
#> [1] TRUE TRUE
con <- file(file2)
writeLines("some text", con)
close(con)

# computing alternative checksums:
checksum(files)
#>   filebd3f1b8e63c4   filebd3f116546f0 
#> "ef46db3751d8e999" "b563efb2061ae502"
xxh64sum(files)
#>   filebd3f1b8e63c4   filebd3f116546f0 
#> "ef46db3751d8e999" "b563efb2061ae502"
md5sum(files)
#>                   filebd3f1b8e63c4                   filebd3f116546f0 
#> "d41d8cd98f00b204e9800998ecf8427e" "4d93d51945b88325c213640ef59fc50b"
sha256sum(files)
#>                                                   filebd3f1b8e63c4 
#> "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" 
#>                                                   filebd3f116546f0 
#> "a23e5fdcd7b276bdd81aa1a0b7b963101863dd3f61ff57935f8c5ba462681ea6"

Created on 2021-03-24 by the reprex package (v1.0.0)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.4 (2021-02-15)
#>  os       Linux Mint 20               
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language nl_BE:nl                    
#>  collate  nl_BE.UTF-8                 
#>  ctype    nl_BE.UTF-8                 
#>  tz       Europe/Brussels             
#>  date     2021-03-24                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                       
#>  askpass       1.1        2019-01-13 [1] CRAN (R 4.0.2)               
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.2)               
#>  class         7.3-18     2021-01-24 [4] CRAN (R 4.0.3)               
#>  classInt      0.4-3      2020-04-07 [1] CRAN (R 4.0.2)               
#>  cli           2.3.0      2021-01-31 [1] CRAN (R 4.0.3)               
#>  crayon        1.4.1      2021-02-08 [1] CRAN (R 4.0.3)               
#>  DBI           1.1.1      2021-01-15 [1] CRAN (R 4.0.3)               
#>  digest        0.6.27     2020-10-24 [1] CRAN (R 4.0.3)               
#>  dplyr         1.0.4      2021-02-02 [1] CRAN (R 4.0.3)               
#>  e1071         1.7-4      2020-10-14 [1] CRAN (R 4.0.3)               
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.2)               
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.2)               
#>  forcats       0.5.1      2021-01-27 [1] CRAN (R 4.0.3)               
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)               
#>  generics      0.1.0      2020-10-31 [1] CRAN (R 4.0.3)               
#>  git2r         0.28.0     2021-01-10 [1] CRAN (R 4.0.3)               
#>  git2rdata     0.3.1      2021-01-21 [1] CRAN (R 4.0.3)               
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)               
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.2)               
#>  htmltools     0.5.1.1    2021-01-22 [1] CRAN (R 4.0.3)               
#>  KernSmooth    2.23-18    2020-10-29 [4] CRAN (R 4.0.3)               
#>  knitr         1.31       2021-01-27 [1] CRAN (R 4.0.3)               
#>  lifecycle     1.0.0      2021-02-15 [1] CRAN (R 4.0.4)               
#>  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.0.3)               
#>  n2khab      * 0.4.0.9000 2021-03-24 [1] local                        
#>  openssl       1.4.3      2020-09-18 [1] CRAN (R 4.0.2)               
#>  pillar        1.4.7      2020-11-20 [1] CRAN (R 4.0.3)               
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.2)               
#>  plyr          1.8.6      2020-03-03 [1] CRAN (R 4.0.2)               
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.0.2)               
#>  R6            2.5.0      2020-10-28 [1] CRAN (R 4.0.3)               
#>  Rcpp          1.0.6      2021-01-15 [1] CRAN (R 4.0.3)               
#>  reprex        1.0.0      2021-01-27 [1] CRAN (R 4.0.3)               
#>  rlang         0.4.10     2020-12-30 [1] CRAN (R 4.0.3)               
#>  rmarkdown     2.6        2020-12-14 [1] CRAN (R 4.0.3)               
#>  rprojroot     2.0.2      2020-11-15 [1] CRAN (R 4.0.3)               
#>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.0.3)               
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.2)               
#>  sf            0.9-8      2021-03-04 [1] Github (florisvdh/sf@34434d1)
#>  stringi       1.5.3      2020-09-09 [1] CRAN (R 4.0.2)               
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.2)               
#>  tibble        3.0.6      2021-01-29 [1] CRAN (R 4.0.3)               
#>  tidyr         1.1.2      2020-08-27 [1] CRAN (R 4.0.2)               
#>  tidyselect    1.1.0      2020-05-11 [1] CRAN (R 4.0.2)               
#>  units         0.6-7      2020-06-13 [1] CRAN (R 4.0.2)               
#>  vctrs         0.3.6      2020-12-17 [1] CRAN (R 4.0.3)               
#>  withr         2.4.1      2021-01-26 [1] CRAN (R 4.0.3)               
#>  xfun          0.21       2021-02-10 [1] CRAN (R 4.0.4)               
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.2)               
#> 
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library

@cecileherr here are the XXH64 hashes that you'll probably need (MD5 provided as check against reference list / Zenodo):

library(n2khab)
library(magrittr)
hms_2018 <- file.path(n2khab::fileman_up("n2khab_data"),
                      "20_processed/habitatmap_stdized_2018_v2/habitatmap_stdized.gpkg")
hms_2020 <- file.path(n2khab::fileman_up("n2khab_data"),
                      "20_processed/habitatmap_stdized/habitatmap_stdized.gpkg")
files <- c(hms_2018, hms_2020)
md5sum(files) %>% set_names(c(2018, 2020))
#>                               2018                               2020 
#> "7f89d4a6bc0b2080a0510169497259ff" "e2f0bff28016bd525ab652349c555141"
xxh64sum(files) %>% set_names(c(2018, 2020))
#>               2018               2020 
#> "b80f469f33636c8b" "3109c26f0a27a0f3"

Copy link
Collaborator

@cecileherr cecileherr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! 👍 (and for the 5* service: you even provide the md5/xxh64 I need 👌 )
I tested this new version and as far as I can see, it seems to work as expected. Both the openssl options and the digest option run smoothly and the few details I had mentioned in my previous review are solved.

msg = paste0("Only files are accepted; ",
"the following path(s) are directories:\n",
paste0(x[isdir], collapse = "\n")))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Clear error messages
Tested: this version works as expected

@florisvdh
Copy link
Member Author

Thanks for your review @cecileherr.

@florisvdh florisvdh merged commit eccd527 into dev_0.5.0 Mar 25, 2021
@florisvdh florisvdh deleted the checksums branch March 25, 2021 14:37
@florisvdh florisvdh changed the title Adding functions md5sum() and sha256sum() Adding checksum functions Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants