Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handeling ES, mongo DB and raw files. #157

Open
wants to merge 1 commit into
base: write-nexus-section
Choose a base branch
from

Conversation

RubelMozumder
Copy link
Contributor

@RubelMozumder RubelMozumder commented Jan 15, 2025

Duplicate PR for PR: #156.

Summary by Sourcery

Enhancements:

  • Improved handling of existing files during file writing operations.

Copy link

sourcery-ai bot commented Jan 15, 2025

Reviewer's Guide by Sourcery

This pull request implements handling of Elastic Search, MongoDB, and raw files by adding a new function delete_entry_file and modifying the write_file method. The delete_entry_file function handles the deletion of entries and associated files from MongoDB, Elastic Search, and the file system. The write_file method is updated to use this new function for file deletion and to handle NeXus files correctly.

Sequence diagram for file writing and deletion process

sequenceDiagram
    participant Client
    participant WriteFile
    participant DeleteEntry
    participant ES as ElasticSearch
    participant MongoDB
    participant FileSystem

    Client->>WriteFile: write_file()
    alt nexus file
        WriteFile->>DeleteEntry: delete_entry_file()
        DeleteEntry->>ES: delete_by_query()
        DeleteEntry->>MongoDB: delete entry
        DeleteEntry->>FileSystem: delete raw files
        WriteFile->>WriteFile: _write_nx_file()
        WriteFile->>FileSystem: process_updated_raw_file()
    else h5 file
        WriteFile->>WriteFile: _write_hdf5_file()
    end
Loading

Class diagram showing updated file handling structure

classDiagram
    class NXFileGenerationError{
        +Exception
    }
    class FileWriter{
        -archive
        -data_file
        -nexus: bool
        -hdf5_data_dict
        +write_file()
        -_write_nx_file()
        -_write_hdf5_file()
    }
    class DeleteEntryFile{
        +delete_entry_file(archive, mainfile, delete_entry)
    }

    FileWriter ..> DeleteEntryFile: uses
    FileWriter --> NXFileGenerationError: throws
Loading

File-Level Changes

Change Details Files
Implement file deletion and entry removal logic.
  • Added the delete_entry_file function to delete entries and files from MongoDB, Elastic Search, and the file system.
  • Modified the write_file method to use the delete_entry_file function for deleting files.
  • Updated error handling in write_file to correctly handle file deletion when NeXus file generation fails.
  • Added logic to process updated raw files when a NeXus file is written.
  • Removed direct file deletion and entry processing logic from _write_nx_file and _write_hdf5_file methods.
  • Added handling for deleting published entries in delete_entry_file function.
  • Added import statements for required classes in delete_entry_file function.
src/nomad_measurements/utils.py
Refactor file writing logic.
  • Removed debugging code from utils.py.
  • Removed unused import statement in _write_nx_file method.
  • Simplified file writing logic by removing redundant checks and operations.
  • Improved code readability by reorganizing and streamlining the file writing process.
src/nomad_measurements/utils.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @RubelMozumder - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Remove the commented out debug code at the top of the file
  • The delete_entry_file function is missing a docstring. Please add documentation explaining the purpose, parameters, and behavior of this function, especially since it handles critical file deletion operations.
Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

).first()):
upload = Upload.objects(upload_id=archive.m_context.upload_id).first()
if upload.published:
raise PermissionError('Published entry cannot be deleted.')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Handle PermissionError when attempting to modify published entries

The PermissionError thrown when attempting to modify published entries isn't caught in write_file(). This could cause unexpected failures in production.

@@ -52,6 +52,8 @@
)


# import debugpy
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider simplifying file handling by separating entry cleanup and improving error handling logic.

The file handling could be simplified while maintaining all functionality by:

  1. Separating entry cleanup from file operations:
def cleanup_entry(archive, mainfile):
    """Handle entry cleanup in database"""
    from nomad.processing import Entry
    entry = Entry.objects(upload_id=archive.m_context.upload_id, mainfile=mainfile).first()
    if entry:
        if Upload.objects(upload_id=archive.m_context.upload_id).first().published:
            raise PermissionError('Published entry cannot be deleted.')
        delete_by_query({'entry_id': entry.entry_id})
        entry.delete()

def write_file(self):
    """Simplified file writing with clear error handling"""
    try:
        if self.nexus:
            self.archive.m_context.delete_raw_file(self.data_file)  # Clean slate
            self._write_nx_file()
            self.archive.m_context.process_updated_raw_file(self.data_file, allow_modify=True)
        else:
            self._write_hdf5_file()
    except Exception as e:
        self.logger.warning("NeXus file creation failed, falling back to HDF5")
        self.nexus = False
        cleanup_entry(self.archive, self.data_file)
        self._write_hdf5_file()

This refactoring:

  • Separates concerns between file operations and entry management
  • Simplifies the error handling flow
  • Makes the fallback process more explicit
  • Reduces code duplication
  • Maintains all existing functionality and safeguards

upload_id=archive.m_context.upload_id, mainfile=mainfile
).first()):
upload = Upload.objects(upload_id=archive.m_context.upload_id).first()
if upload.published:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder at which point this condition will be met. If the upload is published, I think the user won't be able to run the normalization anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also assume as you think, as we are removing the entry and delete the entry. In case no blunder happens.

self._write_hdf5_file()
else:
self._write_hdf5_file()

if self.data_file.endswith('.nxs'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do this right after self._write_nx_file() in line 377

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case write_nx_file fails, no need to run the process_update_raw_file it will stop in writting hdf5 file.

Copy link
Collaborator

@ka-sarthak ka-sarthak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have my doubts about deleting the raw file and entry before creating a new one. In case the re-creation fails for some reason, things that might be relying on the file will break. But, in the current situation with resource contention issues, this seems to be the only thing that works.

We should definitely revisit this and raise the issue in general.

Also, for future reference, can you explain the delete_entry_file function (as PR comments or in the doc string perhaps)? For example, what things are being deleted by these: delete_by_query, entry.delete, delete_rawfiles.

@ka-sarthak ka-sarthak force-pushed the ReprocessingNexusFile_2 branch from cf5b34e to 2d2709d Compare January 15, 2025 15:30
@@ -371,28 +373,32 @@ def write_file(self):
"""
if self.nexus:
try:
delete_entry_file(archive=self.archive, mainfile=self.data_file)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you ensure the caller is not appending new data to existing data?
How can you ensure the caller always provides a complete file?
When the existing file is deleted, how can you tell what contents are valid and may need to be retained?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants