Proposal: Add a synthetic data set for speedy testing & demonstration #114

bryanhanson · 2020-05-12T01:54:50Z

Many packages include a small, lean data set for demonstrating functions, plotting, testing etc. hyperSpec has only real-world data sets that are reasonably large, in some cases very large. The existing large data sets bring with them considerable infrastructure (e.g. in Makefile) and slow down building and checking.

I propose to include a synthetic data set in the package. I already have a couple handy and nearly ready, from other projects.

If this idea is deemed worthy, I would create a branch called "SynData" for the work.

A related suggestion would be to disable/remove the existing data sets and update the Makefile accordingly for the purposes of this branch. Any examples using the existing data would have to be disabled or converted to use the synthetic data set. This would make things simpler and faster. If we also plan to put data in it's own package, this might be the way to go (completely remove the data now and add it later to another package, which addresses the unique storage needs of larger files).

Please respond about the value of 1) adding a synthetic data set and 2) removing the existing data sets and infrastructure for the time-being.

If approved, I will take responsibility for creating and adding the synthetic data set. If we want to disable the existing data sets I will ask Erick to do certain parts of that.

The text was updated successfully, but these errors were encountered:

bryanhanson · 2020-05-12T01:56:15Z

Consider adding synthetic data set

eoduniyi · 2020-05-12T01:58:06Z

ximeg · 2020-05-12T02:01:56Z

Hi @bryanhanson, I support your proposal. Please name your branch feature/114-syndata and start if from the develop branch to keep it in agreement with our new guidelines (#112 and #113 ) 🙂

I agree that the real data should go into a separate package. It should be installed only if needed. Lean Synthetic dataset is also a great idea.

If we move data to a separate package (hyperSpec-datasets?), it would still be quite big. May be we could keep the data available online (e.g. on GitHub) outside of the package as a bunch of files. The hyperSpec-datasets package would download the data during the installation and create RData files on the end user computer. This way the hyperSpec-datasets would be lean as well.

cbeleites · 2020-05-13T19:12:37Z

I like this.

As for the dataset package, we may talk to the CRAN people whether we are allowed to submit a large package there since it will be updated very rarely (if ever).

bryanhanson · 2020-05-13T20:08:04Z

I also have some saved discussion on this from R-devel list; I will check and report back.

…

On May 13, 2020, at 3:12 PM, Claudia Beleites ***@***.***> wrote: I like this. As for the dataset package, we may talk to the CRAN people whether we are allowed to submit a large package there since it will be updated very rarely (if ever). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#114 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABCIPSKFQ2BFXOIMYSQWMTRRLWLHANCNFSM4M6NAQQA>.

ximeg · 2020-05-13T23:08:39Z

I am looking at the code (also of hyperSpec.tidyverse) and see unit tests that make use of the big datasets packaged with hyperSpec, like this one:

  test_that("filtering extra data columns: numeric", {
    skip("until flu is exported from hyperSpec")
    expect_equal(filter(flu, c > 0.3), flu [flu$c > 0.3]) # 0 row object

    # filter drops row names, so only equivalent, not equal:
    expect_equivalent(filter(flu, c > 0.2), flu [flu$c > 0.2])

    expect_equivalent(
      filter(chondro, clusters == "lacuna"),
      chondro [chondro$clusters == "lacuna" & !is.na(chondro$clusters)]
    )
  })

I am afraid that removal of these dataset would require update of multiple unit tests. Or we have to find some workaround. Similarly, vignettes require these datasets to be built. Any ideas?

bryanhanson · 2020-05-14T01:39:41Z

I’ll put the synthetic data set together tomorrow. As far as removing the existing data sets temporarily, yes, it would be a lot of work. I wouldn’t do it w/o careful consideration. So I agree with you Roman. But maybe we will need to pare back strongly and re-assemble.

…

On May 13, 2020, at 7:08 PM, RKiselev ***@***.***> wrote: I am looking at the code (also of hyperSpec.tidyverse) and see unit tests that make use of the big datasets packaged with hyperSpec, like this one: test_that("filtering extra data columns: numeric", { skip("until flu is exported from hyperSpec") expect_equal(filter(flu, c > 0.3), flu [flu$c > 0.3]) # 0 row object # filter drops row names, so only equivalent, not equal: expect_equivalent(filter(flu, c > 0.2), flu [flu$c > 0.2]) expect_equivalent( filter(chondro, clusters == "lacuna"), chondro [chondro$clusters == "lacuna" & !is.na(chondro$clusters)] ) }) I am afraid that removal of these dataset would require update of multiple unit tests. Or we have to find some workaround. Similarly, vignettes require these datasets to be built. Any ideas? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#114 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABCIPQSEYLIMIINGNNNYPDRRMSAHANCNFSM4M6NAQQA>.

cbeleites · 2020-05-15T12:55:17Z

Wrt. the vignettes we have 2 options:

moving to something like pkgdown (see Feature request: online documentation with pkgdown or similar #78) which allows us to publish vignettes that are available online but are not packaged and shipped with hyperSpec (pkgdown calls this articles)
moving to synthetic data
For the unit tests, I believe synthetic data would be good. We already have a number of unit tests where I do "synthetic" changes to the example data sets (e.g. putting some NAs into flu to have a hyperSpec object with missing values).
So synthetic data sets that are expressly good for unit testing would anyways be an enhancement.
The existing data sets are extensively used throughout not only the unit tests but also the examples.
However, the examples do need overhauling, particularly those of very early functions. When I started developing hyperSpec, there was not yet a convenient unit test package in R, and I went with the then common strategy of abusing examples as unit tests.

bryanhanson · 2020-05-15T13:48:23Z

Another option is to have static pdf vignettes using the R.rsp package.

ximeg · 2020-05-15T18:46:02Z

I’ll put the synthetic data set together tomorrow. As far as removing the existing data sets temporarily, yes, it would be a lot of work. I wouldn’t do it w/o careful consideration. So I agree with you Roman. But maybe we will need to pare back strongly and re-assemble.

Just to make it easier to find on GitHub, I copied here the description of the dataset that Bryan put together in the file hyperSpec/R/FauxCell.R:

Faux Cell is a synthetic data set intended for testing and demonstration. It
is small so that it processes quickly. There are 300 Raman-like spectra
allocated to three groups/regions. The spectrum of each region is unique and
simple, with a single peak with a particular frequency and line width. Each
spectrum has 1400 data points. A small amount of noise has been added. The
three regions represent the sample matrix, a cell, and a nucleus within the
cell. The data is indexed along x and y dimensions, simulating data collected
on a grid.

cbeleites · 2020-05-18T15:52:29Z

I think it should be possible to make the generating script/function self-contained.
We can then have the data set generated on the fly, without the need for a big data file.

bryanhanson · 2020-05-27T13:35:40Z

Completed via a series of commits ending with 94bf154 Closing.

cbeleites · 2020-06-08T19:18:19Z

I tried to make fauxCell available as proper data set - but give up for now.

Here is what did not work:

adding a .R script calling .fauxCell(): apparently the data is executed first, before the rest of the package is processed. So class hyperSpec and its functions are not available at that stage. (The Writing R Extensions manual says so as well.)
neiter did adding a datalist file that lists fauxCell work (probably for the same reason).

bryanhanson added the Type: proposal 💡 Proposed ideas for all to consider. label May 12, 2020

bryanhanson mentioned this issue May 17, 2020

Hard Prune #128

Closed

11 tasks

cbeleites mentioned this issue May 18, 2020

How to handle large data sets #129

Open

bryanhanson closed this as completed May 27, 2020

cbeleites mentioned this issue Jun 11, 2020

Convert small data sets to "static" versions #137

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add a synthetic data set for speedy testing & demonstration #114

Proposal: Add a synthetic data set for speedy testing & demonstration #114

bryanhanson commented May 12, 2020 •

edited

Loading

bryanhanson commented May 12, 2020

eoduniyi commented May 12, 2020

ximeg commented May 12, 2020 •

edited

Loading

cbeleites commented May 13, 2020

bryanhanson commented May 13, 2020 via email

ximeg commented May 13, 2020

bryanhanson commented May 14, 2020 via email

cbeleites commented May 15, 2020

bryanhanson commented May 15, 2020

ximeg commented May 15, 2020

cbeleites commented May 18, 2020

bryanhanson commented May 27, 2020

cbeleites commented Jun 8, 2020 •

edited by ximeg

Loading

Proposal: Add a synthetic data set for speedy testing & demonstration #114

Proposal: Add a synthetic data set for speedy testing & demonstration #114

Comments

bryanhanson commented May 12, 2020 • edited Loading

bryanhanson commented May 12, 2020

eoduniyi commented May 12, 2020

ximeg commented May 12, 2020 • edited Loading

cbeleites commented May 13, 2020

bryanhanson commented May 13, 2020 via email

ximeg commented May 13, 2020

bryanhanson commented May 14, 2020 via email

cbeleites commented May 15, 2020

bryanhanson commented May 15, 2020

ximeg commented May 15, 2020

cbeleites commented May 18, 2020

bryanhanson commented May 27, 2020

cbeleites commented Jun 8, 2020 • edited by ximeg Loading

bryanhanson commented May 12, 2020 •

edited

Loading

ximeg commented May 12, 2020 •

edited

Loading

cbeleites commented Jun 8, 2020 •

edited by ximeg

Loading