-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding checksum functions #122
Conversation
It appears to have been a fix for a missing close() statement in the examples (after using writeLines()), rather than a problem with the function. It was done to avoid R auto-closing dangling connections with a warning, a bit later.
BTW |
@hansvancalster I just saw that
Now, I was intending to suggest switching Note, the reason for not using soilmapdbf <- file.path(n2khab::fileman_up("n2khab_data"),
"10_raw/soilmap/soilmap.dbf")
invisible(n2khab::md5sum(soilmapdbf)) # load in memory
system.time(openssl::md5(file(soilmapdbf)))
#> user system elapsed
#> 2.885 0.292 3.180
system.time(openssl::md5(file(soilmapdbf)))
#> user system elapsed
#> 2.699 0.236 2.935
system.time(n2khab::md5sum(soilmapdbf))
#> user system elapsed
#> 2.745 0.240 2.986
system.time(tools::md5sum(soilmapdbf))
#> user system elapsed
#> 2.516 0.220 2.737
system.time(n2khab::md5sum(soilmapdbf))
#> user system elapsed
#> 2.668 0.252 2.921
system.time(tools::md5sum(soilmapdbf))
#> user system elapsed
#> 2.555 0.199 2.755
system.time(n2khab::md5sum(soilmapdbf))
#> user system elapsed
#> 2.651 0.315 2.967
system.time(tools::md5sum(soilmapdbf))
#> user system elapsed
#> 2.513 0.256 2.769
system.time(openssl::md5(file(soilmapdbf)))
#> user system elapsed
#> 2.693 0.312 3.005
system.time(openssl::md5(file(soilmapdbf)))
#> user system elapsed
#> 2.769 0.244 3.013 Created on 2021-03-19 by the reprex package (v1.0.0) Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.4 (2021-02-15)
#> os Linux Mint 20
#> system x86_64, linux-gnu
#> ui X11
#> language nl_BE:nl
#> collate nl_BE.UTF-8
#> ctype nl_BE.UTF-8
#> tz Europe/Brussels
#> date 2021-03-19
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> class 7.3-18 2021-01-24 [4] CRAN (R 4.0.3)
#> classInt 0.4-3 2020-04-07 [1] CRAN (R 4.0.2)
#> cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.3)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> dplyr 1.0.4 2021-02-02 [1] CRAN (R 4.0.3)
#> e1071 1.7-4 2020-10-14 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> forcats 0.5.1 2021-01-27 [1] CRAN (R 4.0.3)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
#> git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.3)
#> git2rdata 0.3.1 2021-01-21 [1] CRAN (R 4.0.3)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#> KernSmooth 2.23-18 2020-10-29 [4] CRAN (R 4.0.3)
#> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> n2khab 0.4.0.9000 2021-03-19 [1] Github (inbo/n2khab@395e504)
#> openssl 1.4.3 2020-09-18 [1] CRAN (R 4.0.2)
#> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3)
#> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.3)
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
#> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3)
#> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> sf 0.9-8 2021-03-04 [1] Github (florisvdh/sf@34434d1)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> tibble 3.0.6 2021-01-29 [1] CRAN (R 4.0.3)
#> tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> units 0.6-7 2020-06-13 [1] CRAN (R 4.0.2)
#> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3)
#> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3)
#> xfun 0.21 2021-02-10 [1] CRAN (R 4.0.4)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library |
One way could be to allocate |
If speed is a concern, you could also consider https://github.com/Cyan4973/xxHash, which is implemented via digest, see https://github.com/ropensci/targets/blob/main/R/utils_digest.R (I always check the targets package - the author of the package choses often the best performing tools). EDIT: but that algorithm is not available via |
Note that
|
Thanks @hansvancalster! I'll have another look at xxHash - I've encountered it when exploring this subject, but missed its R implementation. Indeed I tend to choose more established (more widely screened) ones such as md5 / sha256. For verifying larger amounts of files (e.g. the whole
|
Note that |
Indeed; for that case, the speed difference is too minor IMO to decide among them. So I have no strong opinion. Maybe you are right that |
OK, it appears that See also #112 (comment), where some findings about xxHash have been added. Some timings: soilmapdbf <- file.path(n2khab::fileman_up("n2khab_data"),
"10_raw/soilmap/soilmap.dbf")
invisible(n2khab::md5sum(soilmapdbf)) # load in memory
system.time(digest::digest(soilmapdbf, algo = "xxhash64", file = TRUE))
#> user system elapsed
#> 0.285 0.208 0.494
system.time(digest::digest(soilmapdbf, algo = "md5", file = TRUE))
#> user system elapsed
#> 3.539 0.280 3.819
system.time(n2khab::md5sum(soilmapdbf))
#> user system elapsed
#> 2.690 0.292 2.983
system.time(tools::md5sum(soilmapdbf))
#> user system elapsed
#> 2.521 0.264 2.786
system.time(digest::digest(soilmapdbf, algo = "xxhash64", file = TRUE))
#> user system elapsed
#> 0.296 0.200 0.496
system.time(digest::digest(soilmapdbf, algo = "md5", file = TRUE))
#> user system elapsed
#> 3.663 0.152 3.816
system.time(n2khab::md5sum(soilmapdbf))
#> user system elapsed
#> 2.715 0.272 2.987
system.time(tools::md5sum(soilmapdbf))
#> user system elapsed
#> 2.521 0.264 2.785 Created on 2021-03-22 by the reprex package (v1.0.0) Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.4 (2021-02-15)
#> os Linux Mint 20
#> system x86_64, linux-gnu
#> ui X11
#> language nl_BE:nl
#> collate nl_BE.UTF-8
#> ctype nl_BE.UTF-8
#> tz Europe/Brussels
#> date 2021-03-22
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> class 7.3-18 2021-01-24 [4] CRAN (R 4.0.3)
#> classInt 0.4-3 2020-04-07 [1] CRAN (R 4.0.2)
#> cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.3)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> dplyr 1.0.4 2021-02-02 [1] CRAN (R 4.0.3)
#> e1071 1.7-4 2020-10-14 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> forcats 0.5.1 2021-01-27 [1] CRAN (R 4.0.3)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
#> git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.3)
#> git2rdata 0.3.1 2021-01-21 [1] CRAN (R 4.0.3)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#> KernSmooth 2.23-18 2020-10-29 [4] CRAN (R 4.0.3)
#> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> n2khab 0.4.0.9000 2021-03-19 [1] Github (inbo/n2khab@395e504)
#> openssl 1.4.3 2020-09-18 [1] CRAN (R 4.0.2)
#> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3)
#> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.3)
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
#> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3)
#> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> sf 0.9-8 2021-03-04 [1] Github (florisvdh/sf@34434d1)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> tibble 3.0.6 2021-01-29 [1] CRAN (R 4.0.3)
#> tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> units 0.6-7 2020-06-13 [1] CRAN (R 4.0.2)
#> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3)
#> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3)
#> xfun 0.21 2021-02-10 [1] CRAN (R 4.0.4)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library Sidenote: the md5 hashing by I am quite attracted by the much higher speed of xxHash. So I might want to take the small risk of using XXH64 for now, over MD5 or SHA256. ('Risk': its hash is less unique, its R implementation is tailor-made - not linked, and the technology is not as well-known). Even though disk reading time still matters more. We can still continue to store the MD5/SHA256 as a backup, and perhaps expose the algorithm choice as a user setting. In order to maintain future flexibility with reference to hash algorithm choice, I intend to rewrite this PR a bit so that a more generic hash function is provided e.g. with xxh64, md5 and sha256 as options, and one set as default. When using this function throughout n2khab, we can code it such that it just needs a change in its default, to let the whole package switch to using another hash (by default). |
Thanks for the useful addition! I cannot comment on the choice of the package since I am not experienced enough, but I tested the new functions. I tried with single and multiple files, and with different file types, and
As far as I understand
A side note: because of this difference in usage between
When I encounter an error (in this case on purpose), the next time I use the function I will get a warning (even though the second try works smoothly). Not really a problem, but it might be intriguing/disturbing for users.
|
Thank you for testing @cecileherr!
Yes, that's intended behaviour. File hashes are calculated for files only. I'll add a check to error on folders.
Well spotted 👍. I overlooked the case where the function would error because a folder is provided instead of a file. Although the hashing functions are primarily intended for internal usage, I exported them so it's better to harden them as well. Will try to avoid dangling connections (in case of an error) by re-adopting dropped parts of 1460b88.
For shapefiles, we could indeed replace the folder with the *.shp file as default - that should also work for For shapefiles (and things like So, in the end all functions should rather have a |
Should be solved by 0cec424.
Update: I think this is not needed. Currently only actual files are accepted, by 0cec424. So the connection error should never happen, unless on rare occasions like I/O errors or something like that (but that would yield more error messages anyway). Do you agree @cecileherr? |
In the future we may not use the openssl functions by default. Preparing for that.
General function Currently defaults to the XXH64 hash function. library(n2khab)
# creating two different temporary files:
file1 <- tempfile()
file2 <- tempfile()
files <- c(file1, file2)
file.create(files)
#> [1] TRUE TRUE
con <- file(file2)
writeLines("some text", con)
close(con)
# computing alternative checksums:
checksum(files)
#> filebd3f1b8e63c4 filebd3f116546f0
#> "ef46db3751d8e999" "b563efb2061ae502"
xxh64sum(files)
#> filebd3f1b8e63c4 filebd3f116546f0
#> "ef46db3751d8e999" "b563efb2061ae502"
md5sum(files)
#> filebd3f1b8e63c4 filebd3f116546f0
#> "d41d8cd98f00b204e9800998ecf8427e" "4d93d51945b88325c213640ef59fc50b"
sha256sum(files)
#> filebd3f1b8e63c4
#> "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
#> filebd3f116546f0
#> "a23e5fdcd7b276bdd81aa1a0b7b963101863dd3f61ff57935f8c5ba462681ea6" Created on 2021-03-24 by the reprex package (v1.0.0) Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.4 (2021-02-15)
#> os Linux Mint 20
#> system x86_64, linux-gnu
#> ui X11
#> language nl_BE:nl
#> collate nl_BE.UTF-8
#> ctype nl_BE.UTF-8
#> tz Europe/Brussels
#> date 2021-03-24
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> class 7.3-18 2021-01-24 [4] CRAN (R 4.0.3)
#> classInt 0.4-3 2020-04-07 [1] CRAN (R 4.0.2)
#> cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.3)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> dplyr 1.0.4 2021-02-02 [1] CRAN (R 4.0.3)
#> e1071 1.7-4 2020-10-14 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> forcats 0.5.1 2021-01-27 [1] CRAN (R 4.0.3)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
#> git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.3)
#> git2rdata 0.3.1 2021-01-21 [1] CRAN (R 4.0.3)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#> KernSmooth 2.23-18 2020-10-29 [4] CRAN (R 4.0.3)
#> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> n2khab * 0.4.0.9000 2021-03-24 [1] local
#> openssl 1.4.3 2020-09-18 [1] CRAN (R 4.0.2)
#> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3)
#> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.3)
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
#> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3)
#> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> sf 0.9-8 2021-03-04 [1] Github (florisvdh/sf@34434d1)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> tibble 3.0.6 2021-01-29 [1] CRAN (R 4.0.3)
#> tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> units 0.6-7 2020-06-13 [1] CRAN (R 4.0.2)
#> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3)
#> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3)
#> xfun 0.21 2021-02-10 [1] CRAN (R 4.0.4)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library @cecileherr here are the XXH64 hashes that you'll probably need (MD5 provided as check against reference list / Zenodo): library(n2khab)
library(magrittr)
hms_2018 <- file.path(n2khab::fileman_up("n2khab_data"),
"20_processed/habitatmap_stdized_2018_v2/habitatmap_stdized.gpkg")
hms_2020 <- file.path(n2khab::fileman_up("n2khab_data"),
"20_processed/habitatmap_stdized/habitatmap_stdized.gpkg")
files <- c(hms_2018, hms_2020)
md5sum(files) %>% set_names(c(2018, 2020))
#> 2018 2020
#> "7f89d4a6bc0b2080a0510169497259ff" "e2f0bff28016bd525ab652349c555141"
xxh64sum(files) %>% set_names(c(2018, 2020))
#> 2018 2020
#> "b80f469f33636c8b" "3109c26f0a27a0f3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! 👍 (and for the 5* service: you even provide the md5/xxh64 I need 👌 )
I tested this new version and as far as I can see, it seems to work as expected. Both the openssl options and the digest option run smoothly and the few details I had mentioned in my previous review are solved.
msg = paste0("Only files are accepted; ", | ||
"the following path(s) are directories:\n", | ||
paste0(x[isdir], collapse = "\n"))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Clear error messages
Tested: this version works as expected
Thanks for your review @cecileherr. |
@cecileherr can you test these, e.g. by running the example or for a Zenodo file?
Adding these functions was more of an adventure than originally expected. After writing, I discovered there is also
tools::md5sum()
🤔. However in general I believe it's best to lean on OpenSSL as a standard;tools::md5sum()
seems to be hardcoded in R, and from its docs, may not always work the same on all platforms. Also have struggled with silently closing connections until I found the function itself was not the cause, but the example I ran... At least I think that now (please look for warnings).sha256sum()
can be used to generate and store sha256sums as file metadata. I think this implementation is not available in core R packages.