Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert small data sets to "static" versions #137

Closed
11 tasks done
cbeleites opened this issue May 21, 2020 · 11 comments
Closed
11 tasks done

Convert small data sets to "static" versions #137

cbeleites opened this issue May 21, 2020 · 11 comments
Assignees
Labels
Topic: datasets 📅 Related to datasets in hyperSpec
Milestone

Comments

@cbeleites
Copy link
Owner

cbeleites commented May 21, 2020

Currently, hyperSpec generates its example data sets from real spectra files.
This is done during the building of the vignettes and comes at the cost of a non-standard building process with complex Makefiles (#132).

As part of the general clean up

  • make "static" snapshots of the data sets the primary source, i.e. place them under version control.
    • barbiturates
    • flu
    • laser
    • paracetamol
  • if possible, prune the data to be smaller (barbiturates, laser?)
  • clarify that the example data may not be exactly the same as the data in the corresponding vignettes where the original spectra files are imported.
  • clarify what kind of data they exemplify
  • put code to generate the data and possibly the raw/binary source sets into data-raw/. List data-raw/ in .Rbuildignore.
    see: https://r-pkgs.org/data.html#data-data
  • delete the parts of the Makefiles that create and copy these files into the package directory tree
  • delete the .gitignore entries refering to them.
@bryanhanson
Copy link
Collaborator

On the third point, why not use the static data in the vignettes? That would simplify things a great deal.

@cbeleites
Copy link
Owner Author

@bryanhanson: thanks for spotting. I was thinking of the vignettes that import the real spectra.

(Future) vignettes that are shipped with hyperSpec can and should use the data sets as in hyperSpec.

related: #138

@GegznaV
Copy link
Collaborator

GegznaV commented May 21, 2020

Why is chondro not on this list? (I found the answer)

The description of the datasets may contain sentence, such as "The dataset is provided with the package is a subset of... (some real data)" To illustrate the ideas/functionality, we can even use artificial spectra. Just a thought, but maybe, some functions that simulate spectra could be included?

@GegznaV
Copy link
Collaborator

GegznaV commented May 21, 2020

(Future) vignettes that are shipped with hyperSpec can and should use the data sets as in hyperSpec.

I agree. But the links to the original dataset could also be provided.

@cbeleites cbeleites changed the title Convert data sets to "static" versions Convert small data sets to "static" versions May 21, 2020
@cbeleites
Copy link
Owner Author

chondro is not on this list, because it needs to be dealt with in a different manner since it is too big (#129 ). => the solution for chondro is that

  • For the purpose of examples in the help pages, @bryanhanson's new FauxCell will take the role of chondro to illustrate a spectral map/image.
  • @ximeg is preparing a separate data package that contains the original data of chondro. The vignette (naturally) moves into the data package. The vignette already now depends on the original data: the PCA compressed version shows substantial artifacts (of the PCA compression) that mess up the workflow. But the best lossless compression of the original data we can easily use (xz) is still several MB of data. So no way to ship this other than in a data package.

@cbeleites
Copy link
Owner Author

 links to the original dataset could also be provided.

There is a @source Roxygen parameter that can be used. I was thinking of putting something along the lines e.g. for barbiturates "This data set was prepared from the first few subfiles of "BARBITURATES.SPC"` - for more information and the original file, see package 'hyperSpec_import_spc'"

@GegznaV
Copy link
Collaborator

GegznaV commented May 21, 2020

see package 'hyperSpec_import_spc'"

Are the names of the new packages confirmed? Was there a discussion on that? I would prefer to have some shorter names as hy.import, hy.manager, hy.plot, hyperImport, hyperWrangler hyperPlot, etc.

@GegznaV GegznaV added the Topic: datasets 📅 Related to datasets in hyperSpec label May 21, 2020
@cbeleites
Copy link
Owner Author

no, that discussion was postponed to the next video call (Monday 7 pm EEST).

I opened an issue, though: #140

@GegznaV
Copy link
Collaborator

GegznaV commented Jun 11, 2020

I'm doing some experimentation on this topic. There are some (in my opinion) unnecessary entries in .Rbuildignore or/and in .gitignore, which may have led to unsuccessful results of Claudia's experimentation with data (she talked on that this Monday).

@cbeleites
Copy link
Owner Author

cbeleites commented Jun 11, 2020

I had not been experimenting with the data sets in this issue but with fauxCell. The small data sets here are unproblematic. They are standard .rda files already, and behave as expected. fauxCell is totally different since it is a variable created by the package source code. Explanation on what I tried is at the end of #114 - I don't think it is related to .gitignore (or .Rbuildignore).

You are right that .gitignore will need to be changed: at the moment it reflects which files are created or copied into their place in the package directory tree by make.
I added cleaning up .gitignore and the Makefiles to the TODO list at the top of the issue.

@cbeleites
Copy link
Owner Author

Fixed documentation - this closes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Topic: datasets 📅 Related to datasets in hyperSpec
Projects
None yet
Development

No branches or pull requests

3 participants