Use hdf5 or nexus file in XRD #113

ka-sarthak · 2024-08-28T12:21:30Z

When array data from XRD measurements is added to the archives, the loading time increases as the archives become heavier (especially in the case of RSM which stores multiple 2D arrays). One solution is to use an auxiliary file to offload the heavy data and only save references to the auxiliary files in the archives.

To implement, we can use .h5 files to store the data and make references to the offloaded datasets using HDF5Reference. Additionally, we can also generate a nexus .nx file instead of .h5 file. Nexus file uses the .h5 file as the base file type and validates the data with the data models built by the Nexus community.

The current plots are generated using Plotly. The .json files containing the plot data is also being stored in the archive. This also needs to be offloaded to make the archives lighter. Using H5WebAnnotations of NOMAD, we can leverage the H5Web to generate plots from the .h5 or .nx files.

To this end, the following steps are needed

Use HDF5Reference as the type of the Quantity for array data: intensity, two_theta, q_parallel, q_perpendicular, q_norm, omega, phi, chi.
Implement util class HDF5Handler or functions to create auxiliary files from the normalizers of the schema
Generate a .h5 to store the data and save references to its datasets in HDF5Reference quantities.
Generate a .nxs file based on the archive. This happens in the HDF5Handler and uses pynxtools.
Add annotations in auxiliary files to generate plots for H5Web viewer
Add back compatibility

Summary by Sourcery

Implement support for storing XRD array data in external HDF5 or Nexus files, and generate plots using H5WebAnnotations.

New Features:

Visualize XRD data using H5Web plots.

Tests:

Updated tests to accommodate changes in data handling.

ka-sarthak · 2024-12-19T15:22:19Z

@hampusnasstrom @aalbino2 I merged the implementation of the HDF5Handler and support for .h5 file as an auxiliary file.

The Plotly plots are removed in favor of the plots from H5Web. @budschi's current viewpoint is that Plotly plots have better visualizations and it might be a good idea to preserve them for 1D scans. This can be a point of discussion when we review this PR after the vacations

@RubelMozumder will soon merge his implementations from #147 which will allow to use .nx file as an auxiliary file

ka-sarthak · 2024-12-20T16:34:31Z

@RubelMozumder I have combined the common functionality from walk_through_object and _set_hdf5_ref into one util function resolve_path

ka-sarthak · 2024-12-20T16:37:38Z

TODO

Combine the mapping in nx.py which is ingested by the Handler as an argument.
Try to overwrite the .nxs file without deleting the mainfile. As per @TLCFEM, we should avoid deleting the mainfile

TLCFEM · 2024-12-20T18:34:15Z

Have you checked what is the root cause of the issue?
Is the file still occupied when it is read by something elese?

ka-sarthak · 2024-12-20T22:09:21Z

@TLCFEM I wasn't able to investigate it yet. But this will be among the first things I do in the new year and will reach out to you with my findings. Happy Holidays!

TLCFEM · 2024-12-20T22:12:48Z

If it is not the case, then all discussions are not valid anymore.
So you check the access pattern first.
HDF5 has quite a few caveats and requires some knowledge of how things work internally.

RubelMozumder · 2025-01-07T11:44:45Z

If it is not the case, then all discussions are not valid anymore. So you check the access pattern first. HDF5 has quite a few caveats and requires some knowledge of how things work internally.

If I explain the situation that may lay bear the scenario.
We have eln that takes input put file. In the processing of the eln object or archive.json it generates a output file of type .h5 or .nxs file. If it is a .nxs file then nomad index it as an entry.
So, In the first attempt of eln processing, there is no error and all looks goods.

Issue: In the second attempt of reprocessing the entire upload (archive.json, .nxs and so on) nomad starts processing the archive.json and .nxs (nomad entry). The reprocessing of the archive.json also recreates the .nxs file then the issue comes in place. As far as I understand there are two worker processes working on the same file object .nxs concurrently.

Temporary solution:
In each processing of the archive.json we delete the .nxs (nomad entry) file if it exists and regenerate the .nxs file again. Which might not be the right approach to handle this case.

aalbino2 · 2025-01-07T19:54:22Z

If it is not the case, then all discussions are not valid anymore. So you check the access pattern first. HDF5 has quite a few caveats and requires some knowledge of how things work internally.

If I explain the situation that may lay bear the scenario. We have eln that takes input put file. In the processing of the eln object or archive.json it generates a output file of type .h5 or .nxs file. If it is a .nxs file then nomad index it as an entry. So, In the first attempt of eln processing, there is no error and all looks goods.

Issue: In the second attempt of reprocessing the entire upload (archive.json, .nxs and so on) nomad starts processing the archive.json and .nxs (nomad entry). The reprocessing of the archive.json also recreates the .nxs file then the issue comes in place. As far as I understand there are two worker processes working on the same file object .nxs concurrently.

Temporary solution: In each processing of the archive.json we delete the .nxs (nomad entry) file if it exists and regenerate the .nxs file again. Which might not be the right approach to handle this case.

@RubelMozumder what prevents you from checking the existence of the .nxs file and create a new one only in the case it doesn't exist yet?

Co-authored-by: Sarthak Kapoor <57119427+ka-sarthak@users.noreply.github.com>

Co-authored-by: Sarthak Kapoor <57119427+ka-sarthak@users.noreply.github.com> Co-authored-by: Hampus Näsström <hampus.nasstrom@gmail.com>

RubelMozumder · 2025-01-15T15:21:03Z

After discussing with @TLCFEM, we found the following things:

There is a resource contention issue, where multiple processes try to access the generated nexus file in different modes (read and write). Generation of the nexus file is not the problem, but triggering a reprocess using m_context.process_updated_raw_file(filename, allow_modify=True) from the ELN normalizer can lead to resource contention. This is because a new worker is assigned for this reprocess in parallel to the worker which is handling the normalization. ELN normalization worked might have the nexus file open in write mode, while the reprocess worker tries to open it in read mode to process the nexus entry.

The behavior is unpredictable, as sometimes the entry normalization can happen without the resource contention error, and other times, it might get one.

Some directions for resolving this:

Use sleep timers in the nexus processing that is triggered by the nexus parser. This allows the ELN process to be completed (and the file is closed) before the processing of the nexus entry is triggered. However, this isn't a solution as one can't know what timer value fits all cases.

Delete the nexus file if exists before triggering the nexus file writing from the ELN. This makes sure that no nexus entry is being processed during the nexus file writing process.

Do not reprocess m_context.process_updated_raw_file(filename, allow_modify=True) through the normalizer. This avoids entering the resource contention situation. Rather, the user can trigger a reprocess of the upload from the GUI. Drawback: user inconvenience.

Enforce the reprocess m_context.process_updated_raw_file(filename, allow_modify=True) to use the current process, rather than creating a new worker for it. This can be done by using entry.process_entry_local() instead of entry.process_entry() look here.

It may resolve the race condition reading/writing function on the same file. There is another issue,
Let's suppose the first time the raw file processing nexus writer succeeds and creates a nexus entry. On the second attempt, for some reason, the nexus process fails, but the entry is still there from the first process. It needs to delete the nexus file and entry as well and write an hdf5 file.

I think that needs to be fixed by area-D to delete an entry (corrupted) and its related file from the single process thread running normalizer.

The PR: #157 can help, you see that the test is completely failed.

RubelMozumder · 2025-01-16T06:38:35Z

@lauri-codes, is there any functionality that deletes a entry, associated mainfile and the residue (if there is something e.g. ES data) of that deleted entry? This deletion must happens inside the eln normalization process.

Just a quick overview of implementation:

try:
   create a nexus file which ends up as a nexus entry
Except Error:
   Delete nexus mainfile, entry and residue meatadata
   create hdf file (hdf5 is not a nomad entry)

Then we make reference to concepts in nexus file or hdf5 file for entry quantities.

Currently,
We are using os.remove to delete a mainfile (which we believe not a correct way to do), still the mainfile deletion does not delete the entry and its matadata.

You may want to take a quick view of code in function write_file here:

nomad-measurements/src/nomad_measurements/utils.py

Line 367 in aa0277c

def write_file(self):

I have created a small function to delete mainfile, entry and ES (here:

nomad-measurements/src/nomad_measurements/utils.py

Line 598 in 2d2709d

def delete_entry_file(archive, mainfile, delete_entry=False):

(this raise an error from different process I can not trace back to my code from where the error is coming. It also fail to eln entry normalization process.

If you could please suggest any functionality that is available in NOMAD.

lauri-codes · 2025-01-16T13:48:48Z

@RubelMozumder: There is no such functionality, and I doubt there ever will be. Deleting entries during processing is not something we can really endorse in any way: there are too many ways to screw this up (what happens if the entry is deleted and then an expection happens before the new data is stored? What happens when some other processed entry tries to read the deleted entry simultaneously? What happens if the file is opened by another process and there is a lock on it when someone tries to delete it?)

I would instead want to try and understand what is the goal you are trying to achieve with this normalizer. It is reasonable to create temporary files during normalization and also reasonable to create new entries at the end of normalization (assuming there are no circular processing steps or parallel processes that might cause issues).

ka-sarthak · 2025-01-21T15:50:56Z

First processing:

ELN normalization opens the nexus file in write mode to generate it--> nexus parser opens it in read mode to create the nexus entry.

Reprocessing the upload:

ELN normalization opens the nexus file in write mode, while nexus entry tries to open it in read mode: leading to resource contention.

One way to avoid this is to control the nexus file access using an overwrite nexus file switch (BoolEditQuantity) in the ELN. In the first processing, the ELN generates the nexus file and sets the switch to False. When reprocessing the upload, ELN will not open nexus file in write mode as the switch is not set. When the user wants to update the nexus file, they go inside the entry, set the switch, and reprocess the entry. This overwrites the nexus file before setting back the switch to False. In this case, only the ELN will be accessing the nexus file. No resource contention.

ka-sarthak · 2025-01-21T15:51:22Z

The above solution does not work as intended due to the following issue: https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/merge_requests/2301

ka-sarthak · 2025-01-23T16:53:31Z

The changes made here are back-compatible. However, the oasis admins must re-process all the ELNXRayDiffraction entries.

sourcery-ai · 2025-01-23T16:55:11Z

Reviewer's Guide by Sourcery

This pull request introduces the use of HDF5 or NeXus files to store array data from XRD measurements, which reduces the archive size and loading time. It implements the use of HDF5Reference to refer to datasets in the auxiliary files and adds H5Web annotations for generating plots. It also ensures backward compatibility.

Sequence diagram for XRD data processing with HDF5/Nexus storage

sequenceDiagram
    participant User
    participant XRD as XRDMeasurement
    participant Handler as HDF5Handler
    participant Storage as HDF5/Nexus File
    participant Archive as NOMAD Archive

    User->>XRD: Upload XRD data
    XRD->>Handler: Create HDF5Handler
    Handler->>Handler: add_dataset()
    Handler->>Handler: add_attribute()
    Handler->>Storage: write_file()
    Note over Handler,Storage: Creates .h5 or .nxs file
    Handler->>Archive: set_hdf5_references()
    Note over Handler,Archive: Updates archive with references
    Archive-->>User: Return processed data

Class diagram for the updated XRD data handling

classDiagram
    class HDF5Handler {
        +data_file: str
        +archive: EntryArchive
        +logger: BoundLogger
        +nexus: bool
        +add_dataset()
        +add_attribute()
        +read_dataset()
        +write_file()
        -_write_nx_file()
        -_write_hdf5_file()
        +set_hdf5_references()
    }

    class XRDResult {
        +intensity: HDF5Reference
        +two_theta: HDF5Reference
        +q_norm: HDF5Reference
        +omega: HDF5Reference
        +phi: HDF5Reference
        +chi: HDF5Reference
        +plot_intensity: XRDResultPlotIntensity
        +plot_intensity_scattering_vector: XRDResultPlotIntensityScatteringVector
    }

    class XRDResultPlotIntensity {
        +intensity: HDF5Reference
        +two_theta: HDF5Reference
        +omega: HDF5Reference
        +phi: HDF5Reference
        +chi: HDF5Reference
        +normalize()
    }

    XRDResult --> XRDResultPlotIntensity
    XRDResult --> XRDResultPlotIntensityScatteringVector

File-Level Changes

Change	Details	Files
Utilize HDF5 files to store array data and save references to datasets in `HDF5Reference` quantities.	Modified Quantity types for array data (intensity, two_theta, q_parallel, q_perpendicular, q_norm, omega, phi, chi) to `HDF5Reference`. Added `XRDResultPlotIntensity` and `XRDResultPlotIntensityScatteringVector` sections for plotting intensity over 2-theta and scattering vector, respectively. Implemented the `normalize` methods for the plot sections to add datasets and attributes to the HDF5 file for H5Web viewer. Modified `XRDResult1D` and `XRDResultRSM` to read data from HDF5 files. Added `auxiliary_file` and `overwrite_auxiliary_file` quantities to `ELNXRayDiffraction` section. Modified `ELNXRayDiffraction` to create and use `HDF5Handler` for writing data to HDF5 files. Implemented backward compatibility by removing existing results and figures.	`src/nomad_measurements/xrd/schema.py`
Implement util class `HDF5Handler` to create auxiliary files from the normalizers of the schema.	Implemented `HDF5Handler` class with methods for adding datasets, attributes, reading datasets, and writing the HDF5 file. Added methods to handle `pint.Quantity` and set HDF5 references. Added methods to create NeXus file. Added helper functions for resolving paths and removing nexus annotations.	`src/nomad_measurements/utils.py`
Generate a `.nxs` file based on the archive.	Implemented the `_write_nx_file` method in `HDF5Handler` to generate a NeXus file. Added `NEXUS_DATASET_MAP` to connect nexus file paths to the archive paths. Modified `HDF5Handler` to populate nexus datasets and attributes. Added `pynxtools` as a dependency.	`src/nomad_measurements/xrd/schema.py` `src/nomad_measurements/utils.py` `src/nomad_measurements/xrd/nx.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!
Generate a plan of action for an issue: Comment @sourcery-ai plan on
an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @ka-sarthak - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟡 General issues: 1 issue found
🟢 Security: all looks good
🟡 Testing: 1 issue found
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/nomad_measurements/xrd/schema.py

tests/test_xrd.py

src/nomad_measurements/xrd/schema.py

src/nomad_measurements/utils.py

ka-sarthak · 2025-01-24T12:48:10Z

@hampusnasstrom If an old entry is not re-processed and opened, it's not broken and the data in the HDF5Reference quantity shows the array data. Here's a screenshot:

If now I make some changes and save the entry, it raises the error "Shape mismatch for " and the data section goes away. This isn't good because it does not get fixed if I reprocess the upload.

The safe way is to trigger the reprocess of the upload, rather than doing it from inside the entry.

ka-sarthak self-assigned this Aug 28, 2024

ka-sarthak mentioned this pull request Aug 28, 2024

Use HDF5 references for RSM #103

Open

ka-sarthak force-pushed the write-nexus-section branch 2 times, most recently from f2bef40 to d583974 Compare September 3, 2024 08:55

ka-sarthak force-pushed the write-nexus-section branch from d583974 to 3aeb548 Compare November 25, 2024 10:50

ka-sarthak mentioned this pull request Dec 9, 2024

Use hdf5 references for arrays #118

Merged

ka-sarthak changed the title ~~Use nexus section in XRD~~ Use hdf5/nexus file in XRD Dec 19, 2024

ka-sarthak changed the title ~~Use hdf5/nexus file in XRD~~ Use hdf5 or nexus file in XRD Dec 19, 2024

ka-sarthak marked this pull request as draft December 19, 2024 15:16

ka-sarthak force-pushed the write-nexus-section branch from 71b6952 to 19dec87 Compare December 19, 2024 15:58

aalbino2 and others added 12 commits January 14, 2025 10:56

updated plugin structure

898e3fe

added pynxtools dependency

2f81aa5

Apply suggestions from Sarthak's code review

103076b

Co-authored-by: Sarthak Kapoor <57119427+ka-sarthak@users.noreply.github.com>

ruff linting

a42a707

Ruff linting 2

48be124

Apply suggestions from code review

282c48f

Co-authored-by: Sarthak Kapoor <57119427+ka-sarthak@users.noreply.github.com> Co-authored-by: Hampus Näsström <hampus.nasstrom@gmail.com>

changed xrd parser folder

bc29a43

last fixes and descriptions

581a77d

description of general schema

cdc749d

changed general package into a module

ce2a364

Implement write nexus section based on the populated nomad archive

dfb04b9

Update path of populate_nexus_subsection

78fe74a

remove del method; fix test

aa0277c

ka-sarthak added 2 commits January 16, 2025 11:35

Combine nexus dataset map

a71db65

Ruff

0fed11a

ka-sarthak added 8 commits January 16, 2025 17:21

Make Auxiliary file name without raw file ext

0a055cc

Add cleanup extensions for fixture

2266237

Use bool in ELN to control raw file updation

f484fd8

Remove 'file and entry deletion'

05df712

remove defaults: trigger write if file is missing

c6c891d

Minor

7537466

Set hdf5 references at add_dataset stage

5fc0a8d

Reset on trigger for main branch PR only

49c6f41

ka-sarthak added 5 commits January 23, 2025 16:05

abstract out set hdf5 ref

af89f3b

Comment out nexus, TODOs, docstrings

965a91a

Reprocess nxs entry

5308370

Handle missing dataset in add step

1682c94

Comment out import

96bdf07

ka-sarthak marked this pull request as ready for review January 23, 2025 16:53

ka-sarthak requested review from hampusnasstrom and aalbino2 January 23, 2025 16:54

ka-sarthak added the enhancement New feature or request label Jan 23, 2025

sourcery-ai bot reviewed Jan 23, 2025

View reviewed changes

Review: sourcery

ab1081d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use hdf5 or nexus file in XRD #113

Use hdf5 or nexus file in XRD #113

ka-sarthak commented Aug 28, 2024 •

edited by sourcery-ai bot

Loading

ka-sarthak commented Dec 19, 2024 •

edited

Loading

ka-sarthak commented Dec 20, 2024

ka-sarthak commented Dec 20, 2024 •

edited

Loading

TLCFEM commented Dec 20, 2024

ka-sarthak commented Dec 20, 2024

TLCFEM commented Dec 20, 2024

RubelMozumder commented Jan 7, 2025

aalbino2 commented Jan 7, 2025

RubelMozumder commented Jan 15, 2025

RubelMozumder commented Jan 16, 2025

lauri-codes commented Jan 16, 2025

ka-sarthak commented Jan 21, 2025

ka-sarthak commented Jan 21, 2025

ka-sarthak commented Jan 23, 2025

sourcery-ai bot commented Jan 23, 2025

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

ka-sarthak commented Jan 24, 2025

Use hdf5 or nexus file in XRD #113

Are you sure you want to change the base?

Use hdf5 or nexus file in XRD #113

Conversation

ka-sarthak commented Aug 28, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

ka-sarthak commented Dec 19, 2024 • edited Loading

ka-sarthak commented Dec 20, 2024

ka-sarthak commented Dec 20, 2024 • edited Loading

TLCFEM commented Dec 20, 2024

ka-sarthak commented Dec 20, 2024

TLCFEM commented Dec 20, 2024

RubelMozumder commented Jan 7, 2025

aalbino2 commented Jan 7, 2025

RubelMozumder commented Jan 15, 2025

RubelMozumder commented Jan 16, 2025

lauri-codes commented Jan 16, 2025

ka-sarthak commented Jan 21, 2025

ka-sarthak commented Jan 21, 2025

ka-sarthak commented Jan 23, 2025

sourcery-ai bot commented Jan 23, 2025

Reviewer's Guide by Sourcery

Sequence diagram for XRD data processing with HDF5/Nexus storage

Class diagram for the updated XRD data handling

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

Choose a reason for hiding this comment

ka-sarthak commented Jan 24, 2025

ka-sarthak commented Aug 28, 2024 •

edited by sourcery-ai bot

Loading

ka-sarthak commented Dec 19, 2024 •

edited

Loading

ka-sarthak commented Dec 20, 2024 •

edited

Loading