-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Improve parser performance #318
Conversation
1e7f120
to
94bfa7d
Compare
I believe that the tests are not passing because the parent order written to the exported taxonomy file is not deterministic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good work.
I have some proposal though.
470b04f
to
99f1d5f
Compare
99f1d5f
to
f1e936d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great !
* refactor: mark private function with _ * refactor(parser): add type annotations and clean up code * chore: use context manager to close session in tests * chore: update neo4j and Makefile * refactor: create parser specific directory * refactor: start taxonomy_parser by copying parser file * refactor: move logger to separate file * refactor: remove unnecessary code for taxonomy parser * feat: update TaxonomyParser to return taxonomy class * feat: update parser to use taxonomy parser * chore: update tests for new taxonomy parser * fix: remove multi_label for single project_label * feat: improve node creation performance * feat: add node id index to improve search query performance * feat: improve previous link creation performance * feat: improve child link creation performance * feat: group queries into transaction * chore: update logging info and add timing info * fix: add db name to sessions * refactor: move ellipsis func to logger class * fix: stop id index creation if index exists * fix: resolve comments * fix: resolve comments
What
The goal of this PR is to improve the performance of the writing of the taxonomy data to the
Neo4J
database, so that every taxonomy is parsable in a raisonnable amount of timeNote
This PR is quite big, so to simplify the review, you can go through commit by commit (I tried to make the commits quite small and atomic)
Changes
We can have small performance increases for small taxonomies, but the changes mainly target medium or large taxonomies, by:
Fixes bug(s)
#289
#183
Part of
Follow up of this parser/db writer decoupling PR
Results
We have an overall speedup of x11 (mainly 730.6s -> 25.44s for the
categories
taxonomy). Before, the writes were very slow, so HTTP timeouts were possible when importing a taxonomy file, but now all taxonomies are parsable in less than 30s.Details
Before:
Additives -> 34.93s
Allergens -> 3.31s
Amino Acids -> 1.6s
Categories -> 730.6s
Countries -> 103.49s
Data Quality -> 16.08s
Food Groups -> 1.63s
Improvements -> 1.06s
Ingredients -> 255.13s
Ingredients Analysis -> 1.12s
Ingredients Processing -> 5.94s
Labels -> 56.08s
Languages -> 32.26s
Minerals -> 6.58s
Misc -> 0.64s
Nova Groups -> 0.61s
Nucleotides -> 1.64s
Nutrients -> 10.79s
Origins -> 12.40s
Other Nutritional Substances -> 2.84s
Packaging Materials -> 6.90s
Packaging Recycling -> 0.94s
Packaging Shapes -> 3.81s
Periods After Opening -> 1.34s
Preservation -> 0.57s
States -> 2.19s
Test -> 1.41s
Vitamins -> 2.94s
(Total: 1299s)
After:
All taxonomies are parsable in less than 30s
Additives -> 4.78s
Allergens -> 1.23s
Amino Acids -> 0.77s
Categories -> 25.44s
Countries -> 15.71s
Data Quality -> 1.09s
Food Groups -> 0.96s
Improvements -> 1.41s
Ingredients -> 13.76s
Ingredients Analysis -> 1.25s
Ingredients Processing -> 2.28s
Labels -> 3.95s
Languages -> 14.26s
Minerals -> 4.0s
Misc -> 0.32s
Nova Groups -> 0.71s
Nucleotides -> 1.95s
Nutrients -> 5.87s
Origins -> 5.41s
Other Nutritional Substances -> 1.5s
Packaging Materials -> 2.4s
Packaging Recycling -> 0.66s
Packaging Shapes -> 2.38s
Periods After Opening -> 0.63s
Preservation -> 0.57s
States -> 0.83s
Test -> 0.82s
Vitamins -> 2.40s
(Total: 117s)