-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding many nodes (~ 10000 ) to a group is extremely slow (sqlalchemy backend) #1319
Comments
Yes, it looks slower than Django (order of magnitude)
I tested it with the following script |
I wonder why the time per 100 nodes increases for sqlalchemy, but stays the same for django? May it be related to the issue #1289 ? |
With a bit better management of the session we get the following
e.g.
But it needs a bit of thinking. If Leo/Seb don't find a solution I will have a look at it as soon as I come back |
@yakutovicha |
One source of the problem might be missing indexes in the SqlAlchemy schema:
Notice the difference with the django schema:
Note that the django schema has a |
I added the indexes and I don't see much different in time.
|
I am looking into this and here are some comments (what I write below is with expire_on_commit=False): (b) What we do now, is inneficient since we add one by one the objects to the relationship
We could do something like
We can by-pass the ORM by executing directly an SQL statement to update the relationship as follows:
but the performance gains are not amazing
|
What is amazing it the SQL insert statement with a session.expunge_all() at the beginning - it's faster than Django but we have to do it properly in order not to mess up everything.
|
I think that I found a nice solution inspired by our discussions which I am going to implement if there is no objection (@giovannipizzi, @ltalirz - we can also talk if needed) I will keep an optimized version of the ORM related implementation (but still slow when compared to Django) and I will also have a PostgreSQL specific, SQL implementation with an SQL insert statement that will not use session.expunge_all() and it will let the session objects up-to-date (so after the operation, the ORM group objects will have the right members etc). It is also not affected by the expire_on_commit option. The SQL version will be faster than the existing Django implementation (half the time or even lower) We would be able to choose the desired SQLA group add code using a flag. The SQLA with SQL insert
The Django version
|
@szoupanos Thanks a lot for your work on this, it really matters to us! As hoped, performance-wise this is an improvement of several orders of magnitude - your numbers on adding nodes in batches of 100 with sqla are not very consistent (they vary by 1 order of magnitude) but they are really from a different world than before. Happy to talk to you on Monday to get a better understanding of how it works, but from our side this is great! |
@szoupanos Can you please have a look at this: https://github.com/muhrin/aiida_core/tree/fix_1319_group_adding_slow Description:
NOTE: Before this gets considered for a merge we have to find a way to BEFORE:
AFTER:
|
Thanks a lot Martin for this but in my tests I dont see the big improvements that you see with your changes and the SQLA code remains 2 orders of magnitude slower than the Django code (on group addition). Some numbers from the tests on my VM: SQLA - original code
SQLA with subsession
Django
|
@szoupanos depending on the amount of work left we need to decide if we keep it for 1.0 or 1.1 |
There has been an extensive research on this regarding the best solution. The bulk_add_to_group_pure_sql_one_line.py (it can be easily adapted to avoid inserting duplicates) is the fastest which issues a statement like the following:
However, I understand that we would use, even partially, the SQLA interface (even if we don't use the ORM). So we will proceed with solution bulk_add_to_group_sql_models_one_line_dupl.py |
… the group addition procedure and import method of SQLA should benefit from it (#2471) Fast group addition, skipping the ORM This PR solves issue #1319 and provides a fast way to add nodes to a group that circumvents the ORM. Import/export benefit from this; particularly import under SQLA. In order to achieve this, it uses RAW SQL statements. It was agreed by @giovannipizzi @sphuber and @szoupanos that this is acceptable for aiida 0.X. sqlalchemy >= 1.1 provides better solutions for this in aiida 1.0 (see comments in the code). Furthermore, it adds a new migration `7a6587e16f4c`. For consistency, PR #2514 inserts this migration also in the upstream branch for aiida 1.0 (and handles collisions with other following migrations).
Fixed in PR #2518 |
Trying to append many nodes to a group I noticed that it takes extremely large amount of time.
What I am doing is the following:
The reason I am doing that is because I want to export the whole database using the following command:
verdi export create --group_names test2 test2.export
To reproduce the error I attach a script to append 10000 CifData nodes into AiiDA database.
upload_10000_cifs.tar.gz
The text was updated successfully, but these errors were encountered: