Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multiformat reader #74

Merged
merged 28 commits into from
Aug 12, 2024
Merged

Use multiformat reader #74

merged 28 commits into from
Aug 12, 2024

Conversation

lukaspie
Copy link
Collaborator

@lukaspie lukaspie commented Jul 23, 2024

Hi @domna, I have I started to implement the multi format reader (first focussing on the phi parser).
There are multiple things that are currently working nicely already.

  • file parsing
  • filling template from @attrs -> had to implement a special logic so that I can keep working with my existing notation
  • filling template from @data -> works for the average and raw_data keys, which write the entry/data/data and .../detector/raw_data/raw fields. Also works for any axisnames that are stored as coordinates of my data xarray.I write the individual scans also as DATA fields to NXdata using the get_data_dims function.

Then, I have some questions that are probably already possible for the existing solution in FAIRmat-NFDI/pynxtools#250. Would be thankful for your input on these.

  • filling template from @eln -> previously, I was simply writing @eln in the config file because for a given config key like /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance, there is a corresponding /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance key in the ELN dict after parsing (using the convert_dict). Thus, I did not need to give any path in the config. However, the get_eln_data requires a path str and if I simply write @eln, it doesn't work. I would probably need to write the whole path (incl. upper- and lowercase notation), which is quite cumbersome.
    EDIT: works now as config keys are passed to get_* methods

  • Allow lists as config value to provide different options depending if the data exists in the specified position. As an example, for ["@attrs:title", "@eln, "no title"], the converter should check (in order) the @attrs and @eln tokens and write the respective value if it exists there. If not, it defaults to "no title". However, not it writes this list ["@attrs:title", "@eln, "no title"] into the field.
    EDIT: works now through Allow lists of possible values in config file pynxtools#390

Finally, there are some things that I have no idea how to implement with the current solution:

  • multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented here.
    EDIT: works as config dict is exposed to post processing

  • In my ELN file, I would like to give the possibility to have some metadata for all entries and then some only for specific entries. I can do this now using some special logic in the eln file parsing, but I am not sure that is even possible with the NOMAD ELNs. Just wanted to get your opinion on this if it is worth working in this direction.
    EDIT: works, but only for XPS reader

And then, finally finally, there could be some additional functionalities added. In the XpsReader, I check that any unit in the final template is a valid pint unit, but for this I need to iterate the whole array. We could implement this earlier and make it a generic functionality in the multi format reader as well.

  • implemented only for the XPS reader

Sorry for the long text, but these are all the issues I encountered.

@domna
Copy link
Collaborator

domna commented Jul 23, 2024

filling template from @ELN -> previously, I was simply writing @ELN in the config file because for a given config key like /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance, there is a corresponding /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance key in the ELN dict after parsing (using the convert_dict). Thus, I did not need to give any path in the config. However, the get_eln_data requires a path str and if I simply write @ELN, it doesn't work. I would probably need to write the whole path (incl. upper- and lowercase notation), which is quite cumbersome.

I see. We could also pass the config dict key down to the get_... functions. I first did it but removed it later because I had the feeling that it's not really used. But we can surely re-add it.

Allow lists as config value to provide different options depending if the data exists in the specified position. As an example, for ["@attrs:title", "@ELN, "no title"], the converter should check (in order) the @attrs and @ELN tokens and write the respective value if it exists there. If not, it defaults to "no title". However, not it writes this list ["@attrs:title", "@ELN, "no title"] into the field.

I implemented the notation differently. You need to write @attrs,@eln,"no title:title. It might be a bit weird now that I think about it, but at least there is a mechanism for it which we can tweak to our liking.

multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented FAIRmat-NFDI/pynxtools#388

Could something like this work: https://github.com/FAIRmat-NFDI/pynxtools-mpes/blob/25983bae61f04ad379d77fd5b335fe4759ba4766/tests/data/config_file.json#L113

I like the approach of exposing the config dict for manipulation and being able to do some post-processing on it. So let's go for your approach.

In my ELN file, I would like to give the possibility to have some metadata for all entries and then some only for specific entries. I can do this now using some special logic in the eln file parsing, but I am not sure that is even possible with the NOMAD ELNs. Just wanted to get your opinion on this if it is worth working in this direction.

One possibility I also thought about also notation wise is that we actually not write /ENTRY[entry]/... and replace it but use /ENTRY/... to denote that this is a generic concept which will be filled by the reader. This also allows that we can actually write something specific to an entry, e.g., called my_entry with /ENTRY[my_entry]/... (or give the possibility there to say only to entry number 1 or so).

But I'm not sure if this is the use case you are aiming at.

@domna
Copy link
Collaborator

domna commented Jul 23, 2024

multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented FAIRmat-NFDI/pynxtools#388

Btw. there is also the *{name1, name2} notation, which we use here. This creates multiple fields with the respective names enlisted. If there is a * in any of the sub-field values it replaces those, too. This could be helpful if you have a fixed set of detector names.

@lukaspie
Copy link
Collaborator Author

lukaspie commented Jul 23, 2024

filling template from @ELN -> previously, I was simply writing @ELN in the config file because for a given config key like /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance, there is a corresponding /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance key in the ELN dict after parsing (using the convert_dict). Thus, I did not need to give any path in the config. However, the get_eln_data requires a path str and if I simply write @ELN, it doesn't work. I would probably need to write the whole path (incl. upper- and lowercase notation), which is quite cumbersome.

I see. We could also pass the config dict key down to the get_... functions. I first did it but removed it later because I had the feeling that it's not really used. But we can surely re-add it.

Yeah, I think that would be quite helpful, also for some of the other callbacks (like in the get_data_dims, I want to decide which dims to use based on whether I am writing the raw detector data or the final NXdata).

Allow lists as config value to provide different options depending if the data exists in the specified position. As an example, for ["@attrs:title", "@ELN, "no title"], the converter should check (in order) the @attrs and @ELN tokens and write the respective value if it exists there. If not, it defaults to "no title". However, not it writes this list ["@attrs:title", "@ELN, "no title"] into the field.

I implemented the notation differently. You need to write @attrs,@eln,"no title:title. It might be a bit weird now that I think about it, but at least there is a mechanism for it which we can tweak to our liking.

Can you spell this out again? There may be an error in what you wrote. I tried "/ENTRY[entry]/title":"@attrs:title,@eln,no title", but that did not work.

multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented FAIRmat-NFDI/pynxtools#388

Could something like this work: https://github.com/FAIRmat-NFDI/pynxtools-mpes/blob/25983bae61f04ad379d77fd5b335fe4759ba4766/tests/data/config_file.json#L113

I like the approach of exposing the config dict for manipulation and being able to do some post-processing on it. So let's go for your approach.

The problem for me is that depending on the measurement settings, more or less detectors may be used, which I can infer from the actual data. So I need to dynamically replace them and cannot use the *{name1, name2} notation because I don't know name1, name2 beforehand. Thus, my idea was to actually make copies of the existing fields in the config, with the detector names replaced.
For example, if I have in my config file

"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[detector]/name":"@attrs:detector/name"
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[detector]/name_link":"@link:/entry/instrument/detector/name",

the process_multiple_entities method would dynamically populate the config dict with

"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_first_detector]/name":"@attrs:my_first_detector/name",
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_first_detector]/name_link":"@link:/entry/instrument/my_first_detector/name",
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_second_detector]/name":"@attrs:my_second_detector/name",
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_second_detector]/name_link":"@link:/entry/instrument/my_second_detector/name",

and remove the keys with the generic detector from the dict.

For the XPS reader, this works (after merging my PR in pynxtools). We can think if we want a generic solution for this in the MultiFormatReader or if each reader shall implement this itself.

In my ELN file, I would like to give the possibility to have some metadata for all entries and then some only for specific entries. I can do this now using some special logic in the eln file parsing, but I am not sure that is even possible with the NOMAD ELNs. Just wanted to get your opinion on this if it is worth working in this direction.

One possibility I also thought about also notation wise is that we actually not write /ENTRY[entry]/... and replace it but use /ENTRY/... to denote that this is a generic concept which will be filled by the reader. This also allows that we can actually write something specific to an entry, e.g., called my_entry with /ENTRY[my_entry]/... (or give the possibility there to say only to entry number 1 or so).

But I'm not sure if this is the use case you are aiming at.

My plan was to do it like here, have some generic ELN metadata and then some metadata specific to entry Su1s.

Do you want this /ENTRY/... notation for all renameable field in the config file? I like it, but then the reader must also give the analyser, collectioncolumn, etc. names.

@domna
Copy link
Collaborator

domna commented Jul 23, 2024

Yeah, I think that would be quite helpful, also for some of the other callbacks (like in the get_data_dims, I want to decide which dims to use based on whether I am writing the raw detector data or the final NXdata).

This can be easily added. Edit: It's added in the multiformatreader PR.

Can you spell this out again? There may be an error in what you wrote. I tried "/ENTRY[entry]/title":"@attrs:title,@ELN,no title", but that did not work.

If I understand my code correctly it expects something like "/ENTRY[entry]/title":"@attrs,@eln,no title:title", were it would pass the path title to the get_attrs and get_eln functions (basically it only allows a single value for all keywords). But your notation probably makes more sense here (+ is more flexible).

For the XPS reader, this works (after merging my PR in pynxtools). We can think if we want a generic solution for this in the MultiFormatReader or if each reader shall implement this itself.

I think it's fine if this is a specialised version. The MultiFormatReader is rather an aid tool to quickly build a reader for different formats. I think what we do in mpes and xps is comparably involved already (and we mainly want the config dict standardisation there).

My plan was to do it like here, have some generic ELN metadata and then some metadata specific to entry Su1s.

Do you want this /ENTRY/... notation for all renameable field in the config file? I like it, but then the reader must also give the analyser, collectioncolumn, etc. names.

This should work with /ENTRY/ and /ENTRY[Su1s]/ then. I would restrict it to ENTRY (for now at least), because we only keep track of the entry names in the reader and not the other names. Edit: I did the relevant change. So you probably need to change your config file to get all entries again.

@domna
Copy link
Collaborator

domna commented Jul 23, 2024

I also update the prefix-value parsing now and it should work as you tried it, e.g., @attrs:test,@eln,no title should pass test to get_attrs, '' to get_eln and otherwise writes no title. Whitespaces are added to the paths so they should be avoided, e.g., @attrs: test ,@eln would call get_attrs(key=...,value=" test ").

@coveralls
Copy link

coveralls commented Jul 25, 2024

Pull Request Test Coverage Report for Build 10354752021

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 9976726494: 0.0%
Covered Lines: 48
Relevant Lines: 48

💛 - Coveralls

@lukaspie lukaspie marked this pull request as ready for review July 26, 2024 09:57
@lukaspie lukaspie requested a review from domna July 26, 2024 09:57
Copy link
Collaborator

@domna domna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good. Found a few parts which can be made simpler. But nothing as big as blocking this PR.

I did not look into all details, but I think it's well tested now anyways :)

src/pynxtools_xps/reader.py Outdated Show resolved Hide resolved
src/pynxtools_xps/reader.py Outdated Show resolved Hide resolved
src/pynxtools_xps/reader.py Outdated Show resolved Hide resolved
src/pynxtools_xps/reader.py Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@lukaspie lukaspie merged commit 266d7ac into main Aug 12, 2024
7 checks passed
@lukaspie lukaspie deleted the use-multiformat-reader branch August 12, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants