Skip to content
This repository has been archived by the owner on May 27, 2024. It is now read-only.

fix: extensive loading harmonization config #80

Merged
merged 2 commits into from
Nov 18, 2022

Conversation

rkokkelk
Copy link
Contributor

When Event is init, it automatically loads the harmonization config file. This means that for every event uploaded, the entire harmonization config file is loaded and parsed. This dramatically extends the time needed to parse a single event.

This PR ensures that the harmonization file is only loaded once and then used for generating all Event objects.

Fixes: #79

When `Event` is init, it automatically loads the harmonization config
file. This means that for every event uploaded, the entire harmonization
config file is loaded and parsed. This dramatically extends the time
needed to parse a single event.

This PR ensures that the harmonization file is only loaded once and then
used for generating all `Event` objects.

Fixes: certat#79
Copy link
Contributor

@sebix sebix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding this performance bottleneck and providing the fix!

I suggest to load the harmonization directly from the file and not to extract it from the class

intelmq_webinput_csv/bin/backend.py Outdated Show resolved Hide resolved
Ensure that harmonization config file is loaded from file directly,
instead of `Event` object.
@rkokkelk
Copy link
Contributor Author

Is any further improvement on the efficiency of submitting the files appreciated. Mainly thinking about using Python concurrent features to parallelize the generating of the raw_message. If this is appreciated, I'll look into create a separate PR for it.

@sebix
Copy link
Contributor

sebix commented Oct 27, 2022

Keep in mind that there's a fork which has not all the features of this repo (e.g. in the fork the fieldnames are fixed, here arbitrary extra.* names can be set) but its backend is completely re-written.

@rkokkelk
Copy link
Contributor Author

I'm familiar with the fork. however, due to it being more JS heavy, we encountered more issues with it when handling very large files. We therefore shifted our focus on the original.

rkokkelk added a commit to rkokkelk/intelmq-webinput-csv that referenced this pull request Oct 29, 2022
Events were previously serialized serialy. However by using Python
concurrent features it is possible to execute event serialization over
multiple cores greatly decreasing overall execution time.

The actual sending of the serialized event is done without
parallelization due to threading lock issues in the Python Redis
implementation.

Relies on PR: certat#80
@sebix
Copy link
Contributor

sebix commented Nov 1, 2022

I'm familiar with the fork. however, due to it being more JS heavy, we encountered more issues with it when handling very large files.

Another reasons why it's "only" a fork. However it has CSRF protection and offers authentication (same as the manager), which is a must criterium in some organizations.

@aaronkaplan
Copy link

aaronkaplan commented Nov 9, 2022 via email

@rkokkelk
Copy link
Contributor Author

rkokkelk commented Nov 9, 2022

Okay awesome, could you then maybe look into the already open PR which will at least greatly increase the overall processing time for uploading events into IntelMQ.

rkokkelk added a commit to rkokkelk/intelmq-webinput-csv that referenced this pull request Nov 14, 2022
Events were previously serialized serialy. However by using Python
concurrent features it is possible to execute event serialization over
multiple cores greatly decreasing overall execution time.

The actual sending of the serialized event is done without
parallelization due to threading lock issues in the Python Redis
implementation.

Relies on PR: certat#80
@rkokkelk rkokkelk merged commit 30b4fe4 into certat:master Nov 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV Uploader performance issues
3 participants