Use multiformat reader #74

lukaspie · 2024-07-23T12:04:34Z

Hi @domna, I have I started to implement the multi format reader (first focussing on the phi parser).
There are multiple things that are currently working nicely already.

file parsing
filling template from @attrs -> had to implement a special logic so that I can keep working with my existing notation
filling template from @data -> works for the average and raw_data keys, which write the entry/data/data and .../detector/raw_data/raw fields. Also works for any axisnames that are stored as coordinates of my data xarray.I write the individual scans also as DATA fields to NXdata using the get_data_dims function.

Then, I have some questions that are probably already possible for the existing solution in FAIRmat-NFDI/pynxtools#250. Would be thankful for your input on these.

filling template from @eln -> previously, I was simply writing @eln in the config file because for a given config key like /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance, there is a corresponding /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance key in the ELN dict after parsing (using the convert_dict). Thus, I did not need to give any path in the config. However, the get_eln_data requires a path str and if I simply write @eln, it doesn't work. I would probably need to write the whole path (incl. upper- and lowercase notation), which is quite cumbersome.
EDIT: works now as config keys are passed to get_* methods
Allow lists as config value to provide different options depending if the data exists in the specified position. As an example, for ["@attrs:title", "@eln, "no title"], the converter should check (in order) the @attrs and @eln tokens and write the respective value if it exists there. If not, it defaults to "no title". However, not it writes this list ["@attrs:title", "@eln, "no title"] into the field.
EDIT: works now through Allow lists of possible values in config file pynxtools#390

Finally, there are some things that I have no idea how to implement with the current solution:

multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented here.
EDIT: works as config dict is exposed to post processing
In my ELN file, I would like to give the possibility to have some metadata for all entries and then some only for specific entries. I can do this now using some special logic in the eln file parsing, but I am not sure that is even possible with the NOMAD ELNs. Just wanted to get your opinion on this if it is worth working in this direction.
EDIT: works, but only for XPS reader

And then, finally finally, there could be some additional functionalities added. In the XpsReader, I check that any unit in the final template is a valid pint unit, but for this I need to iterate the whole array. We could implement this earlier and make it a generic functionality in the multi format reader as well.

implemented only for the XPS reader

Sorry for the long text, but these are all the issues I encountered.

domna · 2024-07-23T14:24:50Z

filling template from @ELN -> previously, I was simply writing @ELN in the config file because for a given config key like /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance, there is a corresponding /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance key in the ELN dict after parsing (using the convert_dict). Thus, I did not need to give any path in the config. However, the get_eln_data requires a path str and if I simply write @ELN, it doesn't work. I would probably need to write the whole path (incl. upper- and lowercase notation), which is quite cumbersome.

I see. We could also pass the config dict key down to the get_... functions. I first did it but removed it later because I had the feeling that it's not really used. But we can surely re-add it.

Allow lists as config value to provide different options depending if the data exists in the specified position. As an example, for ["@attrs:title", "@ELN, "no title"], the converter should check (in order) the @attrs and @ELN tokens and write the respective value if it exists there. If not, it defaults to "no title". However, not it writes this list ["@attrs:title", "@ELN, "no title"] into the field.

I implemented the notation differently. You need to write @attrs,@eln,"no title:title. It might be a bit weird now that I think about it, but at least there is a mechanism for it which we can tweak to our liking.

multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented FAIRmat-NFDI/pynxtools#388

Could something like this work: https://github.com/FAIRmat-NFDI/pynxtools-mpes/blob/25983bae61f04ad379d77fd5b335fe4759ba4766/tests/data/config_file.json#L113

I like the approach of exposing the config dict for manipulation and being able to do some post-processing on it. So let's go for your approach.

In my ELN file, I would like to give the possibility to have some metadata for all entries and then some only for specific entries. I can do this now using some special logic in the eln file parsing, but I am not sure that is even possible with the NOMAD ELNs. Just wanted to get your opinion on this if it is worth working in this direction.

One possibility I also thought about also notation wise is that we actually not write /ENTRY[entry]/... and replace it but use /ENTRY/... to denote that this is a generic concept which will be filled by the reader. This also allows that we can actually write something specific to an entry, e.g., called my_entry with /ENTRY[my_entry]/... (or give the possibility there to say only to entry number 1 or so).

But I'm not sure if this is the use case you are aiming at.

domna · 2024-07-23T14:32:08Z

multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented FAIRmat-NFDI/pynxtools#388

Btw. there is also the *{name1, name2} notation, which we use here. This creates multiple fields with the respective names enlisted. If there is a * in any of the sub-field values it replaces those, too. This could be helpful if you have a fixed set of detector names.

lukaspie · 2024-07-23T15:02:31Z

filling template from @ELN -> previously, I was simply writing @ELN in the config file because for a given config key like /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance, there is a corresponding /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance key in the ELN dict after parsing (using the convert_dict). Thus, I did not need to give any path in the config. However, the get_eln_data requires a path str and if I simply write @ELN, it doesn't work. I would probably need to write the whole path (incl. upper- and lowercase notation), which is quite cumbersome.

I see. We could also pass the config dict key down to the get_... functions. I first did it but removed it later because I had the feeling that it's not really used. But we can surely re-add it.

Yeah, I think that would be quite helpful, also for some of the other callbacks (like in the get_data_dims, I want to decide which dims to use based on whether I am writing the raw detector data or the final NXdata).

Allow lists as config value to provide different options depending if the data exists in the specified position. As an example, for ["@attrs:title", "@ELN, "no title"], the converter should check (in order) the @attrs and @ELN tokens and write the respective value if it exists there. If not, it defaults to "no title". However, not it writes this list ["@attrs:title", "@ELN, "no title"] into the field.

I implemented the notation differently. You need to write @attrs,@eln,"no title:title. It might be a bit weird now that I think about it, but at least there is a mechanism for it which we can tweak to our liking.

Can you spell this out again? There may be an error in what you wrote. I tried "/ENTRY[entry]/title":"@attrs:title,@eln,no title", but that did not work.

multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented FAIRmat-NFDI/pynxtools#388

Could something like this work: https://github.com/FAIRmat-NFDI/pynxtools-mpes/blob/25983bae61f04ad379d77fd5b335fe4759ba4766/tests/data/config_file.json#L113

I like the approach of exposing the config dict for manipulation and being able to do some post-processing on it. So let's go for your approach.

The problem for me is that depending on the measurement settings, more or less detectors may be used, which I can infer from the actual data. So I need to dynamically replace them and cannot use the *{name1, name2} notation because I don't know name1, name2 beforehand. Thus, my idea was to actually make copies of the existing fields in the config, with the detector names replaced.
For example, if I have in my config file

"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[detector]/name":"@attrs:detector/name"
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[detector]/name_link":"@link:/entry/instrument/detector/name",

the process_multiple_entities method would dynamically populate the config dict with

"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_first_detector]/name":"@attrs:my_first_detector/name",
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_first_detector]/name_link":"@link:/entry/instrument/my_first_detector/name",
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_second_detector]/name":"@attrs:my_second_detector/name",
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_second_detector]/name_link":"@link:/entry/instrument/my_second_detector/name",

and remove the keys with the generic detector from the dict.

For the XPS reader, this works (after merging my PR in pynxtools). We can think if we want a generic solution for this in the MultiFormatReader or if each reader shall implement this itself.

In my ELN file, I would like to give the possibility to have some metadata for all entries and then some only for specific entries. I can do this now using some special logic in the eln file parsing, but I am not sure that is even possible with the NOMAD ELNs. Just wanted to get your opinion on this if it is worth working in this direction.

One possibility I also thought about also notation wise is that we actually not write /ENTRY[entry]/... and replace it but use /ENTRY/... to denote that this is a generic concept which will be filled by the reader. This also allows that we can actually write something specific to an entry, e.g., called my_entry with /ENTRY[my_entry]/... (or give the possibility there to say only to entry number 1 or so).

But I'm not sure if this is the use case you are aiming at.

My plan was to do it like here, have some generic ELN metadata and then some metadata specific to entry Su1s.

Do you want this /ENTRY/... notation for all renameable field in the config file? I like it, but then the reader must also give the analyser, collectioncolumn, etc. names.

domna · 2024-07-23T17:30:18Z

Yeah, I think that would be quite helpful, also for some of the other callbacks (like in the get_data_dims, I want to decide which dims to use based on whether I am writing the raw detector data or the final NXdata).

This can be easily added. Edit: It's added in the multiformatreader PR.

Can you spell this out again? There may be an error in what you wrote. I tried "/ENTRY[entry]/title":"@attrs:title,@ELN,no title", but that did not work.

If I understand my code correctly it expects something like "/ENTRY[entry]/title":"@attrs,@eln,no title:title", were it would pass the path title to the get_attrs and get_eln functions (basically it only allows a single value for all keywords). But your notation probably makes more sense here (+ is more flexible).

For the XPS reader, this works (after merging my PR in pynxtools). We can think if we want a generic solution for this in the MultiFormatReader or if each reader shall implement this itself.

I think it's fine if this is a specialised version. The MultiFormatReader is rather an aid tool to quickly build a reader for different formats. I think what we do in mpes and xps is comparably involved already (and we mainly want the config dict standardisation there).

My plan was to do it like here, have some generic ELN metadata and then some metadata specific to entry Su1s.

Do you want this /ENTRY/... notation for all renameable field in the config file? I like it, but then the reader must also give the analyser, collectioncolumn, etc. names.

This should work with /ENTRY/ and /ENTRY[Su1s]/ then. I would restrict it to ENTRY (for now at least), because we only keep track of the entry names in the reader and not the other names. Edit: I did the relevant change. So you probably need to change your config file to get all entries again.

domna · 2024-07-23T18:05:52Z

I also update the prefix-value parsing now and it should work as you tried it, e.g., @attrs:test,@eln,no title should pass test to get_attrs, '' to get_eln and otherwise writes no title. Whitespaces are added to the paths so they should be avoided, e.g., @attrs: test ,@eln would call get_attrs(key=...,value=" test ").

coveralls · 2024-07-25T13:28:31Z

Pull Request Test Coverage Report for Build 10354752021

Details

1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 9976726494:	0.0%
Covered Lines:	48
Relevant Lines:	48

💛 - Coveralls

domna

Looks all good. Found a few parts which can be made simpler. But nothing as big as blocking this PR.

I did not look into all details, but I think it's well tested now anyways :)

src/pynxtools_xps/reader.py

pyproject.toml

…tools-xps into use-multiformat-reader

lukaspie added 4 commits July 18, 2024 13:27

simplify entries in mapper output

08aa71a

change to entry generation in mapper output

881bc66

renaming in phi parser

a490e9c

write data through multireader

0f9f7f8

lukaspie added 12 commits July 24, 2024 16:31

update phi reader, config, examples

a973e06

implement data returns in reader

42908fe

update scienta reader, config, eln file

16a83d0

improve atom types parsing from ELN

908fa42

prepare vms reader

5108918

prepare vms txt export reader

f34e630

prepare specs sle reader

cce01f6

prepare specs xy reader

4583db2

prepare specs xml reader

c16714f

small adjustments to example/test ELN data and configs

073796a

regenerate examples and test data

6d622cd

use all test cases

37cd63a

lukaspie added 2 commits July 25, 2024 14:09

avoid using sets in get_entry_names

de93f2a

mypy fixes

21fe19c

lukaspie mentioned this pull request Jul 25, 2024

Adds multi format reader FAIRmat-NFDI/pynxtools#250

Merged

2 tasks

lukaspie added 2 commits July 26, 2024 09:40

use parent_key in yaml parsing

ce3e55b

only output logger.info

0fdff9a

lukaspie marked this pull request as ready for review July 26, 2024 09:57

lukaspie requested a review from domna July 26, 2024 09:57

bump ruff to v0.5.5

f775297

domna approved these changes Jul 26, 2024

View reviewed changes

src/pynxtools_xps/reader.py Outdated Show resolved Hide resolved

src/pynxtools_xps/reader.py Outdated Show resolved Hide resolved

src/pynxtools_xps/reader.py Outdated Show resolved Hide resolved

src/pynxtools_xps/reader.py Show resolved Hide resolved

lukaspie added 2 commits July 26, 2024 13:54

code simplications

05c9800

add todos for non-used MultiFormatReader methods

235285c

lukaspie commented Jul 26, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

lukaspie added 5 commits August 12, 2024 11:49

remove optional keys from default configs

f362043

update test data

b46cdd2

update pynxtools to v0.6

864a81a

Merge branch 'use-multiformat-reader' of github.com:FAIRmat-NFDI/pynx…

594ce8e

…tools-xps into use-multiformat-reader

ruff fix

a4e77d7

lukaspie merged commit 266d7ac into main Aug 12, 2024
7 checks passed

lukaspie deleted the use-multiformat-reader branch August 12, 2024 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use multiformat reader #74

Use multiformat reader #74

lukaspie commented Jul 23, 2024 •

edited

Loading

domna commented Jul 23, 2024

domna commented Jul 23, 2024

lukaspie commented Jul 23, 2024 •

edited

Loading

domna commented Jul 23, 2024 •

edited

Loading

domna commented Jul 23, 2024

coveralls commented Jul 25, 2024 •

edited

Loading

domna left a comment

Use multiformat reader #74

Use multiformat reader #74

Conversation

lukaspie commented Jul 23, 2024 • edited Loading

domna commented Jul 23, 2024

domna commented Jul 23, 2024

lukaspie commented Jul 23, 2024 • edited Loading

domna commented Jul 23, 2024 • edited Loading

domna commented Jul 23, 2024

coveralls commented Jul 25, 2024 • edited Loading

Pull Request Test Coverage Report for Build 10354752021

Details

💛 - Coveralls

domna left a comment

Choose a reason for hiding this comment

lukaspie commented Jul 23, 2024 •

edited

Loading

lukaspie commented Jul 23, 2024 •

edited

Loading

domna commented Jul 23, 2024 •

edited

Loading

coveralls commented Jul 25, 2024 •

edited

Loading