Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handeling ES, mongo DB and raw files. #156

Open
wants to merge 19 commits into
base: write-nexus-section
Choose a base branch
from

Conversation

RubelMozumder
Copy link
Contributor

@RubelMozumder RubelMozumder commented Jan 10, 2025

@ka-sarthak, I think it would fix deleting a nomad entry from Mongo, ES, and staging file systems. Please, take a look into it.

aalbino2 and others added 19 commits December 19, 2024 16:57
Co-authored-by: Sarthak Kapoor <57119427+ka-sarthak@users.noreply.github.com>
Co-authored-by: Sarthak Kapoor <57119427+ka-sarthak@users.noreply.github.com>
Co-authored-by: Hampus Näsström <hampus.nasstrom@gmail.com>
* Implement write nexus section based on the populated nomad archive

* app def missing.

* mapping nomad_measurement.

* All concept are connected, creates nexus file and subsection.

* adding links in hdf5 file.

* Remove the nxs file.

* back to the previous design.

* Include pynxtools plugins in nomad.yaml and extend dependencies including pynxtools ans pnxtools-xrd.

* PR review correction.

* Remove the entry_type overwtitten.

* Remove comments.

* Replace __str__ function.

* RUFF

* Update pyproject.toml

Co-authored-by: Sarthak Kapoor <57119427+ka-sarthak@users.noreply.github.com>

* Update src/nomad_measurements/xrd/schema.py

Co-authored-by: Sarthak Kapoor <57119427+ka-sarthak@users.noreply.github.com>

* Update src/nomad_measurements/xrd/nx.py

* Replace Try-block.

---------

Co-authored-by: Sarthak Kapoor <sarthakkapoor@hotmail.com>
Co-authored-by: Sarthak Kapoor <57119427+ka-sarthak@users.noreply.github.com>
* updated plugin structure

* added pynxtools dependency

* Apply suggestions from code review

Co-authored-by: Sarthak Kapoor <57119427+ka-sarthak@users.noreply.github.com>
Co-authored-by: Hampus Näsström <hampus.nasstrom@gmail.com>

* Add sections for RSM and 1D which uses HDF5 references

* Abstract out data interaction using setter and getter; allows to use same methods for classes with hdf5 refs

* Use arrays, not references, in the `archive.results` section

* Lock the state for using nexus file and corresponding references

* Populate results without references

* Make a general reader for raw files

* Remove nexus flags

* Add quantity for auxialiary file

* Fix rebase

* Make integration_time as hdf5reference

* Reset results (refactor)

* Add backward compatibility

* Refactor reader

* add missing imports

* AttrDict class

* Make concept map global

* Add function to remove nexus annotations in concept map

* Move try block inside walk_through_object

* Fix imports

* Add methods for generating hdf5 file

* Rename auxiliary file

* Expect aux file to be .nxs in the beginning

* Add attributes for hdf5: data_dict, dataset_paths

* Method for adding a quantity to hdf5_data_dict

* Abstract out methods for creating files based on hdf5_data_dict

* Add dataset_paths for nexus

* Some reverting back

* Minor fixes

* Refactor populate_hdf5_data_dict: store a reference to be made later

* Handle shift from nxs to hdf5

* Set hdf5 references after aux file is created

* Cleaning

* Fixing

* Redefine result sections instead of extending

* Remove plotly plots from ELN

* Read util for hdf5 ref

* Fixing

* Move hdf5 handling into a util class

* Refactor instance variables

* Reset data dicts and reference after each writing

* Fixing

* Overwrite dataset if it already exists

* Refactor add_dataset

* Reorganize and doctrings

* Rename variable

* Add read_dataset method

* Cleaning

* Adapting schema with hdf5 handler

* Cooments, minor refactoring

* Fixing; add `hdf5_handler` as an attribute for archive

* Reorganization

* Fixing

* Refactoring

* Cleaning

* Try block for using hdf5 handler: dont fail early, as later normalization steps will have the handler!

* Extract units from dataset attrs when reading

* Fixing

* Linting

* Make archive_path optional in add_dataset

* Rename class

* attrs for add_dataset; use it for units

* Add add_attribute method

* Refactor add_attribute

* Add plot attributes: 1D

* Refactor hdf5 states

* Add back plotly figures

* rename auxiliary file name if changed by handler

* Add referenced plots

* Allow hard link using internel reference

* Add sections for plots

* Comment out validation

* Add archive paths for the plot subsections

* Add back validation with flag

* Use nexus flag

* Add interpolated intensity data into h5 for qspace plots

* Use prefix to reduce len of string

* Store regularized linespace of q vectors; revise descriptions

* Remove plotly plots

* Bring plots to overview

* Fix tests

* Linting; remove attr arg from add_dataset

* Review: move none check into method

* Review: use 'with' for opening h5 file

* Review: make internal states as private vars

* Add pydantic basemodel for dataset

* Use data from variables if available for reading

* Review: remove lazy arg

* Move DatasetModel outside Handler class

* Remove None from get, as it is already a default

* Merge if conditions

---------

Co-authored-by: Andrea Albino <andrea.albino@physik.hu-berlin.de>
Co-authored-by: Andrea Albino <95371554+aalbino2@users.noreply.github.com>
Co-authored-by: Hampus Näsström <hampus.nasstrom@gmail.com>
* Remove the Nexus file before regenerating it.


* Reference to the NeXus entry.

* PR review comments.
Copy link

sourcery-ai bot commented Jan 10, 2025

Reviewer's Guide by Sourcery

This pull request refactors the file writing process to handle Elasticsearch, MongoDB, and raw files more efficiently. It introduces a new function delete_entry_file to manage file deletion and updates the write_file function to use this new function. The changes also improve error handling and logging, and simplify the logic for writing different file types.

Sequence diagram for updated file writing process

sequenceDiagram
    participant Client
    participant WriteFile
    participant DeleteEntry
    participant Storage
    participant ES
    participant MongoDB

    Client->>WriteFile: write_file()
    alt nexus file
        WriteFile->>DeleteEntry: delete_entry_file()
        DeleteEntry->>Storage: Check if file exists
        DeleteEntry->>Storage: Delete raw files
        DeleteEntry->>MongoDB: Delete entry
        DeleteEntry->>ES: Delete by query
        WriteFile->>Storage: Write NeXus file
        WriteFile->>Storage: Process updated raw file
    else hdf5 file
        WriteFile->>Storage: Write HDF5 file
    end
Loading

Class diagram for file handling components

classDiagram
    class WriteFile {
        +write_file()
        -_write_nx_file()
        -_write_hdf5_file()
    }

    class DeleteEntryFile {
        +delete_entry_file(archive, mainfile, delete_entry)
    }

    class Upload {
        +upload_id
        +published
    }

    class Entry {
        +entry_id
        +upload_id
        +mainfile
    }

    class StagingUploadFiles {
        +delete_rawfiles(path)
    }

    WriteFile ..> DeleteEntryFile : uses
    DeleteEntryFile ..> Upload : checks
    DeleteEntryFile ..> Entry : manages
    DeleteEntryFile ..> StagingUploadFiles : uses
Loading

File-Level Changes

Change Details Files
Introduction of a new delete_entry_file function.
  • A new function delete_entry_file is introduced to handle the deletion of entry files and associated data in Elasticsearch and MongoDB.
  • The function takes the archive, mainfile, and an optional delete_entry flag as arguments.
  • It checks if the file exists and if the upload is published before proceeding with deletion.
  • It uses StagingUploadFiles to delete raw files and delete_by_query to delete associated entries in Elasticsearch.
  • If the delete_entry flag is set, it also deletes the entry from MongoDB.
src/nomad_measurements/utils.py
Refactoring of the write_file function.
  • The write_file function is updated to use the new delete_entry_file function.
  • The logic for handling NeXus and HDF5 file writing is simplified.
  • Error handling is improved by calling delete_entry_file with appropriate arguments in case of exceptions.
  • The process_updated_raw_file function is called after writing the NeXus file.
  • Redundant code for deleting files is removed.
src/nomad_measurements/utils.py
Removal of unused imports and code.
  • The unused import of Entry in the _write_nx_file function is removed.
  • Redundant code for checking and deleting existing files before writing new ones is removed from _write_nx_file.
src/nomad_measurements/utils.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @RubelMozumder - I've reviewed your changes - here's some feedback:

Overall Comments:

  • The new delete_entry_file() function needs proper documentation - please add a docstring describing its purpose, parameters, and possible exceptions (especially the PermissionError for published entries).
Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -569,6 +564,29 @@ def _set_hdf5_reference(
resolved_section.m_set(quantity_name, ref)


def delete_entry_file(archive, mainfile, delete_entry=False):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Consider handling the case where Entry.first() returns None to avoid potential AttributeError

@RubelMozumder RubelMozumder requested review from TLCFEM and removed request for TLCFEM January 12, 2025 17:57
Copy link

@TLCFEM TLCFEM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this whole thing relies on deleting things to make things work.
This often indicates other parts are wrongly implemented.
This kind of functionality should be discouraged and shall not be implemented in the first place, because it encourages potential risky and faulty code.

You have been warned.


stagingUploadFiles = StagingUploadFiles(archive.m_context.upload_id)
stagingUploadFiles.delete_rawfiles(path=mainfile)
entry = Entry.objects(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to pay the cost if not used.
So changes like this is better.

        if delete_entry and (entry := Entry.objects(
                upload_id=archive.m_context.upload_id, mainfile=mainfile
            ).first()):
                delete_by_query({'entry_id': entry.entry_id})
                entry.delete()

@@ -569,6 +564,29 @@ def _set_hdf5_reference(
resolved_section.m_set(quantity_name, ref)


def delete_entry_file(archive, mainfile, delete_entry=False):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first input argument can directly be upload_id.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer explicit key-value argument for optional arguments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants