Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add a synthetic data set for speedy testing & demonstration #114

Closed
bryanhanson opened this issue May 12, 2020 · 13 comments
Closed
Labels
Type: proposal 💡 Proposed ideas for all to consider.

Comments

@bryanhanson
Copy link
Collaborator

bryanhanson commented May 12, 2020

Many packages include a small, lean data set for demonstrating functions, plotting, testing etc. hyperSpec has only real-world data sets that are reasonably large, in some cases very large. The existing large data sets bring with them considerable infrastructure (e.g. in Makefile) and slow down building and checking.

I propose to include a synthetic data set in the package. I already have a couple handy and nearly ready, from other projects.

If this idea is deemed worthy, I would create a branch called "SynData" for the work.

A related suggestion would be to disable/remove the existing data sets and update the Makefile accordingly for the purposes of this branch. Any examples using the existing data would have to be disabled or converted to use the synthetic data set. This would make things simpler and faster. If we also plan to put data in it's own package, this might be the way to go (completely remove the data now and add it later to another package, which addresses the unique storage needs of larger files).

Please respond about the value of 1) adding a synthetic data set and 2) removing the existing data sets and infrastructure for the time-being.

If approved, I will take responsibility for creating and adding the synthetic data set. If we want to disable the existing data sets I will ask Erick to do certain parts of that.

@bryanhanson bryanhanson added the Type: proposal 💡 Proposed ideas for all to consider. label May 12, 2020
@bryanhanson
Copy link
Collaborator Author

@eoduniyi
Copy link
Collaborator

@ximeg
Copy link
Collaborator

ximeg commented May 12, 2020

Hi @bryanhanson, I support your proposal. Please name your branch feature/114-syndata and start if from the develop branch to keep it in agreement with our new guidelines (#112 and #113 ) 🙂

I agree that the real data should go into a separate package. It should be installed only if needed. Lean Synthetic dataset is also a great idea.

If we move data to a separate package (hyperSpec-datasets?), it would still be quite big. May be we could keep the data available online (e.g. on GitHub) outside of the package as a bunch of files. The hyperSpec-datasets package would download the data during the installation and create RData files on the end user computer. This way the hyperSpec-datasets would be lean as well.

@cbeleites
Copy link
Owner

I like this.

As for the dataset package, we may talk to the CRAN people whether we are allowed to submit a large package there since it will be updated very rarely (if ever).

@bryanhanson
Copy link
Collaborator Author

bryanhanson commented May 13, 2020 via email

@ximeg
Copy link
Collaborator

ximeg commented May 13, 2020

I am looking at the code (also of hyperSpec.tidyverse) and see unit tests that make use of the big datasets packaged with hyperSpec, like this one:

  test_that("filtering extra data columns: numeric", {
    skip("until flu is exported from hyperSpec")
    expect_equal(filter(flu, c > 0.3), flu [flu$c > 0.3]) # 0 row object

    # filter drops row names, so only equivalent, not equal:
    expect_equivalent(filter(flu, c > 0.2), flu [flu$c > 0.2])

    expect_equivalent(
      filter(chondro, clusters == "lacuna"),
      chondro [chondro$clusters == "lacuna" & !is.na(chondro$clusters)]
    )
  })

I am afraid that removal of these dataset would require update of multiple unit tests. Or we have to find some workaround. Similarly, vignettes require these datasets to be built. Any ideas?

@bryanhanson
Copy link
Collaborator Author

bryanhanson commented May 14, 2020 via email

@cbeleites
Copy link
Owner

Wrt. the vignettes we have 2 options:

  • moving to something like pkgdown (see Feature request: online documentation with pkgdown or similar #78) which allows us to publish vignettes that are available online but are not packaged and shipped with hyperSpec (pkgdown calls this articles)

  • moving to synthetic data

  • For the unit tests, I believe synthetic data would be good. We already have a number of unit tests where I do "synthetic" changes to the example data sets (e.g. putting some NAs into flu to have a hyperSpec object with missing values).
    So synthetic data sets that are expressly good for unit testing would anyways be an enhancement.

  • The existing data sets are extensively used throughout not only the unit tests but also the examples.
    However, the examples do need overhauling, particularly those of very early functions. When I started developing hyperSpec, there was not yet a convenient unit test package in R, and I went with the then common strategy of abusing examples as unit tests.

@bryanhanson
Copy link
Collaborator Author

Another option is to have static pdf vignettes using the R.rsp package.

@ximeg
Copy link
Collaborator

ximeg commented May 15, 2020

I’ll put the synthetic data set together tomorrow. As far as removing the existing data sets temporarily, yes, it would be a lot of work. I wouldn’t do it w/o careful consideration. So I agree with you Roman. But maybe we will need to pare back strongly and re-assemble.

Just to make it easier to find on GitHub, I copied here the description of the dataset that Bryan put together in the file hyperSpec/R/FauxCell.R:

Faux Cell is a synthetic data set intended for testing and demonstration. It
is small so that it processes quickly. There are 300 Raman-like spectra
allocated to three groups/regions. The spectrum of each region is unique and
simple, with a single peak with a particular frequency and line width. Each
spectrum has 1400 data points. A small amount of noise has been added. The
three regions represent the sample matrix, a cell, and a nucleus within the
cell. The data is indexed along x and y dimensions, simulating data collected
on a grid.

@bryanhanson bryanhanson mentioned this issue May 17, 2020
11 tasks
@cbeleites
Copy link
Owner

I think it should be possible to make the generating script/function self-contained.
We can then have the data set generated on the fly, without the need for a big data file.

@bryanhanson
Copy link
Collaborator Author

Completed via a series of commits ending with 94bf154 Closing.

@cbeleites
Copy link
Owner

cbeleites commented Jun 8, 2020

I tried to make fauxCell available as proper data set - but give up for now.

Here is what did not work:

  • adding a .R script calling .fauxCell(): apparently the data is executed first, before the rest of the package is processed. So class hyperSpec and its functions are not available at that stage. (The Writing R Extensions manual says so as well.)
  • neiter did adding a datalist file that lists fauxCell work (probably for the same reason).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: proposal 💡 Proposed ideas for all to consider.
Projects
None yet
Development

No branches or pull requests

4 participants