-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Add a synthetic data set for speedy testing & demonstration #114
Comments
|
Hi @bryanhanson, I support your proposal. Please name your branch I agree that the real data should go into a separate package. It should be installed only if needed. Lean Synthetic dataset is also a great idea. If we move data to a separate package ( |
I like this. As for the dataset package, we may talk to the CRAN people whether we are allowed to submit a large package there since it will be updated very rarely (if ever). |
I also have some saved discussion on this from R-devel list; I will check and report back.
… On May 13, 2020, at 3:12 PM, Claudia Beleites ***@***.***> wrote:
I like this.
As for the dataset package, we may talk to the CRAN people whether we are allowed to submit a large package there since it will be updated very rarely (if ever).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#114 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABCIPSKFQ2BFXOIMYSQWMTRRLWLHANCNFSM4M6NAQQA>.
|
I am looking at the code (also of test_that("filtering extra data columns: numeric", {
skip("until flu is exported from hyperSpec")
expect_equal(filter(flu, c > 0.3), flu [flu$c > 0.3]) # 0 row object
# filter drops row names, so only equivalent, not equal:
expect_equivalent(filter(flu, c > 0.2), flu [flu$c > 0.2])
expect_equivalent(
filter(chondro, clusters == "lacuna"),
chondro [chondro$clusters == "lacuna" & !is.na(chondro$clusters)]
)
}) I am afraid that removal of these dataset would require update of multiple unit tests. Or we have to find some workaround. Similarly, vignettes require these datasets to be built. Any ideas? |
I’ll put the synthetic data set together tomorrow. As far as removing the existing data sets temporarily, yes, it would be a lot of work. I wouldn’t do it w/o careful consideration. So I agree with you Roman. But maybe we will need to pare back strongly and re-assemble.
… On May 13, 2020, at 7:08 PM, RKiselev ***@***.***> wrote:
I am looking at the code (also of hyperSpec.tidyverse) and see unit tests that make use of the big datasets packaged with hyperSpec, like this one:
test_that("filtering extra data columns: numeric", {
skip("until flu is exported from hyperSpec")
expect_equal(filter(flu, c > 0.3), flu [flu$c > 0.3]) # 0 row object
# filter drops row names, so only equivalent, not equal:
expect_equivalent(filter(flu, c > 0.2), flu [flu$c > 0.2])
expect_equivalent(
filter(chondro, clusters == "lacuna"),
chondro [chondro$clusters == "lacuna" & !is.na(chondro$clusters)]
)
})
I am afraid that removal of these dataset would require update of multiple unit tests. Or we have to find some workaround. Similarly, vignettes require these datasets to be built. Any ideas?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#114 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABCIPQSEYLIMIINGNNNYPDRRMSAHANCNFSM4M6NAQQA>.
|
Wrt. the vignettes we have 2 options:
|
Another option is to have static pdf vignettes using the R.rsp package. |
Just to make it easier to find on GitHub, I copied here the description of the dataset that Bryan put together in the file
|
I think it should be possible to make the generating script/function self-contained. |
Completed via a series of commits ending with 94bf154 Closing. |
I tried to make Here is what did not work:
|
Many packages include a small, lean data set for demonstrating functions, plotting, testing etc.
hyperSpec
has only real-world data sets that are reasonably large, in some cases very large. The existing large data sets bring with them considerable infrastructure (e.g. inMakefile
) and slow down building and checking.I propose to include a synthetic data set in the package. I already have a couple handy and nearly ready, from other projects.
If this idea is deemed worthy, I would create a branch called "SynData" for the work.
A related suggestion would be to disable/remove the existing data sets and update the
Makefile
accordingly for the purposes of this branch. Any examples using the existing data would have to be disabled or converted to use the synthetic data set. This would make things simpler and faster. If we also plan to put data in it's own package, this might be the way to go (completely remove the data now and add it later to another package, which addresses the unique storage needs of larger files).Please respond about the value of 1) adding a synthetic data set and 2) removing the existing data sets and infrastructure for the time-being.
If approved, I will take responsibility for creating and adding the synthetic data set. If we want to disable the existing data sets I will ask Erick to do certain parts of that.
The text was updated successfully, but these errors were encountered: