Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-14336: Creating processor to list box folder contents #9784

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ncover21
Copy link
Contributor

@ncover21 ncover21 commented Mar 7, 2025

Summary

NIFI-14336

A processor responsible for listing folder items for a Box Folder.

  • different from ListBoxFIle because it is able to take in a flowFile containing a folder id in an attribute and use that instead of providing a static one. No state is maintained either

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

Copy link
Contributor

@pvillard31 pvillard31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the way this processor is likely going to be used, I think we should be prescriptive here and avoid generating one flow file per listed file in the configured folder. Instead I would generate a single FlowFile with JSON content with an array of records, each record containing the metadata information of the listed files.

Assuming a folder with 50k files, that will avoid generating 50k flowfiles in one execution of the processor. By generating one single flowfile, a user could then use a first SliptRecord processor configured with 1000 records split, then a second SplitRecord with 1 record split, and finally a ForkRecord with path(s) for the fields that should be moved into flowfile attributes. This way the backpressure would do its job.

Thoughts?

@ncover21
Copy link
Contributor Author

ncover21 commented Mar 7, 2025

Given the way this processor is likely going to be used, I think we should be prescriptive here and avoid generating one flow file per listed file in the configured folder. Instead I would generate a single FlowFile with JSON content with an array of records, each record containing the metadata information of the listed files.

Assuming a folder with 50k files, that will avoid generating 50k flowfiles in one execution of the processor. By generating one single flowfile, a user could then use a first SliptRecord processor configured with 1000 records split, then a second SplitRecord with 1 record split, and finally a ForkRecord with path(s) for the fields that should be moved into flowfile attributes. This way the backpressure would do its job.

Thoughts?

Thanks for the review, yes I think that would make more sense for folders with large amounts of files in them. I've adjusted the logic to add in a writer and write the contents to a record instead. I've also added a batch based writing system in case of large numbers of files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants