Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base Attributes: Software Deps & Machine #170

Merged
merged 2 commits into from
Jan 30, 2018

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Jan 23, 2018

Define new base attributes for software dependencies and involved machine.

  • Review: we need to decide if we want to make the new attributes recommended (warn if missing) or optional. I would favor for recommended, since these info are quite important for reproducible science and should be automated. -> optional

(Required would also be possible but earliest in 2.0 since it would break existing files and might be too strict compared to other base attributes such as software. We can also add it as optional now and upgrade it to recommended in later major versions in case general workflows develop around it from the community.)

Implements issues: #116 #137

Description

The individual software alone is not sufficient for proper documentation and reproducible data creation. We therefore reserve new attributes for both dependencies of the software and involved machinery.

For many data files, reproducing how it was created is increasingly complicated. Besides the need to share input, the environment with and on which a software was build can have dramatic influence on the outcome, e.g. due to changes / later discovered bugs in dependencies such as writer libraries or linear algebra libraries, etc.

Examples

For pythonic-software, the semicolon-joined output of pip freeze or conda list --export would be ideal a good start for the attribute softwareDependencies.

On HPC systems, output of module list would be a good starting point.

For CMake based builds, the versions of software in a target's INTERFACE_LINK_LIBRARIES property could be used to auto-generate a list.

For machine the simple hostname, the name of a scientific instrument (camera type, etc.) or cluster name are a good value. For hardware-centric projects, also a list of relevant hardware and versioning could be used.

Affected Components

  • base

Logic Changes

None.

Writer Changes

Writers should (recommended) write the machine (e.g. hostname) and software dependencies of software to new openPMD files now.

We should also update out example files to include the two new attributes:

Reader Changes

No effects besides additional information that can be read.

Data Converter

No effect, old files are forward compatible to this change.

@ax3l ax3l added minor change backwards-compatible change needs decision solutions emerged but we did not decide on the final solution yet labels Jan 23, 2018
@ax3l
Copy link
Member Author

ax3l commented Jan 23, 2018

Discussion result today: we will keep it optional for now, just to keep the burden low because collecting machine and dependency information can be cumbersome.

When we have concrete examples on how to automate this in various programming languages, e.g. build systems, we can reconsider making it recommended.

I will update the PR soon and make it "work in progress" (WIP) for now - so please do not merge yet.

@ax3l ax3l changed the title Base Attributes: Software Deps & Machine [WIP] Base Attributes: Software Deps & Machine Jan 23, 2018
@ax3l ax3l removed the needs decision solutions emerged but we did not decide on the final solution yet label Jan 23, 2018
@RemiLehe
Copy link
Member

Actually, would it be fine to postpone this for 2.0 (or even later)?
This is mainly because I think that it is quite complicated to include all the complete, relevant information (such as e.g. OS version, topology of MPI network, CPU and GPU model, etc...) in the two strings that are added. In particular, this would require some kind of standardized format for these strings, which is probably a whole project in itself ; what do you think?

Copy link
Collaborator

@DavidSagan DavidSagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not make this required nor recommended for a number of reasons: 1) To be useful the writer program will have to collect the necessary information in some automated way and I do not see how to do this. 2) I suspect that this information will only rarely be useful. 3) What information to include is rather vague.

@t184256
Copy link

t184256 commented Jan 25, 2018

I think this options would have quite a lot of value for no implementor cost attached, provided it is optional and handled by an OpenPMD helper library, so that one only remembers about its existence once a bug hunting session ensues. And yeah, the premise is valid, the recreation input and conditions are something that implementors eventually want to add to the output.

For pythonic-software, the semicolon-joined output of pip freeze or conda list --export would be ideal for the attribute softwareDependencies.

'Ideal' is a strong word, eh. A script to run, a complete set of input data and, dunno, a Nix expression for a fully reproducible environment would be what I call 'ideal' (and, of course, unattainable). Let's stop using 'ideal' early on.

pip freeze spews a lot of unneeded stuff. Wouldn't something like pipreqs output be better here? The downside is that it may miss out on some hidden imports.

For pythonic-software ... On HPC systems ... For CMake based builds ...

That's not mutually exclusive, like, at all. Does that mean that we want a whole tree of system attributes or something?

@ax3l
Copy link
Member Author

ax3l commented Jan 25, 2018

@RemiLehe Since we make it optional and do not impose anything besides the format for <name>[@version] (not even that) I think we can already add it.

The topic on how to add something and what needs to be added when is a totally different one imho and can not be answered in the scope of openPMD alone :)

@ax3l
Copy link
Member Author

ax3l commented Jan 25, 2018

@RemiLehe @t184256 how we fill this with helper libraries would be a wonderful contribution to one ouf the openPMD projects that we can add. Currently, I would really like to reserve the keyword so people can try it in reality. There is no harm, it's purely informational and we will find various great solutions for various use cases.

For example, in PIConGPU we will just serialize our --version output, which shows all dependencies and add it. For a highly documented and software-arched HPC system such as the NERSC and ORNL ones, e.g. a hostname (titan-login3) plus the module list (boost@1.62.0;gcc@5.4.0 ...) can in some cases be already very great for reproducible. On my laptop (hostname: ax3l) this information might not be sufficient.

Reproducible software environments are a very domain and application specific topic, that we e.g. discuss regularily in Helmholtz Open Science. We should discuss the how and what in such environments to not go OT here.

Why I want this in openPMD is just to draw an informational connection so one can already pinpoint the major components that an app developer recognizes as central for the data creation. Not more and not less so far.

@ax3l ax3l changed the title [WIP] Base Attributes: Software Deps & Machine Base Attributes: Software Deps & Machine Jan 25, 2018
@ax3l
Copy link
Member Author

ax3l commented Jan 25, 2018

@RemiLehe I update the PR with the discussed change to make the keyword optional and purely informational.

I hope the description above clarifies what it is intended for and I am interested what people will use it for and what solutions will develop!

I also removed the restriction on a format besides "comma-separated list" so people can also decide to add container URIs or something :)

Implements openPMD#116 and openPMD#137 for reproducible data creation.

The individual software alone is not sufficient for proper
documentation and reproduction. We therefore reserve new attributes
for both dependencies of the `software` and hardware.
@RemiLehe
Copy link
Member

OK, I guess if this is optional, it is fine. I'm fine with merging this.

@ax3l
Copy link
Member Author

ax3l commented Jan 30, 2018

Just FYI: While updating the validator, I added in the example creation script a possible way to collect imported dependencies in python: https://github.com/openPMD/openPMD-validator/pull/31/files

The function get_software_dependencies() generates e.g. the string python@2.7.13;numpy@1.13.3;hdf5@1.8.18;h5py@2.7.1

@RemiLehe RemiLehe merged commit 7b257be into openPMD:upcoming-1.1.0 Jan 30, 2018
@ax3l ax3l deleted the topic-dependenciesMachine branch January 30, 2018 16:07
@ax3l ax3l mentioned this pull request Feb 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor change backwards-compatible change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants