-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solution for the fast node addition to group for v0.12 #2471
Solution for the fast node addition to group for v0.12 #2471
Conversation
Also, @sphuber let me know how you think that the flag (to skip the ORM) for the group node addition should be passed to the Backend layer at the provenance redesign. E.g. Maybe we could make it only available to the backend and when someone wants to use it, should use this interface? We can discuss in person when you have time. |
I think it's ok to use raw SQL for 0.12.3, but we should use the commented version for 1.0. |
@szoupanos
In conclusion: I'm very happy with the improvement concerning adding nodes to groups. |
@ltalirz Does your last bullet point mean that the biggest part of the import is spent into adding the already imported nodes to a group? Does the |
I think that's for @szoupanos to answer... |
I didn't touch the export and it is a common function for both backends (it's worth to see if there are any performance differences between the backends and if there are, if they are significant - but this is a different discussion). The import of SQLA uses (or at least it should use the new code). Let me have a close look... |
I missed to pass the flag to one of the group.add_nodes. The current timings for the import of the given export file: For Django:
For SQLA:
|
That would be great - please push the fix and I will confirm |
c45289c
to
5449949
Compare
If we are all happy for 0.12 with this fix (RAW SQL). I would like to clean a bit the group.add_nodes code. And for the export, it is as I expected (I exported the imported group for every backend): SQLA
Django
|
I think we are.
As mentioned before, I think this is acceptable for the moment, and there are other, more important problems in sqla [1]. If the export does not involve adding nodes to groups (I guess it doesn't), we should just leave it as it is. [1] E.g. the following code still scales super-linearly with the number of nodes (I think this goes back to the from aiida.orm.data.parameter import ParameterData
N = 10000
node_list = [ParameterData(dict={}).store() for i in range(N)] But this can be fixed in a separate PR. |
Also, I would leave, commented, the code using the SQLA ORM, with a comment note saying why we are using raw SQL here (i.e. the version of SQLA is too old and this is fixed in 1.0) |
OK, I will do the necessary cleaning and I will update the PR. @ltalirz Yes, [1] should be related to the expire_on_commit. I think that it is worth seeing the performance of [1] in provenance_redesign with expire_on_commit=True/False and vs Django and then depending on the available time, developments (Django JSONB) and importance of [1], re-discuss. |
… the group addition procedure and import method of SQLA should benefit from it
5449949
to
14c135b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested it and confirm that the verdi import problem is resolved - yay!
Good to merge for me.
@sphuber @giovannipizzi any last comments?
Just going back to the new migration here - I was wondering, if we insert only here, since it's not in 1.0.0, I have the feeling that it means that a person going to 0.12.3 cannot then migrate to 1.0? (also, I think here we are not putting all the logic that we have in 1.0 to deduplicate UUID, if this happens). What would be the timing without the migration, would this be worse? |
Yeah I would be very careful with this. We might be breaking the migration for people who want to go from 0.12* to |
The migration adds the constraint to the database, a constraint that it is used by the query to avoid duplicate check at the python level. This constraint is also added to the models. So if we keep it at the models and not in the database, we will have inconsistencies and tests would fail (there are tests in SQLA that check that). If I am not mistaken, Martin had added this constraint long time ago "somewhere" and maybe it is already at 1.0.* |
Yes it is in |
So I did some tests related to the discussion that we had and as far the unique constraint is named differently in the two migartions, everything works fine - but, of course you end up with 2 constraints with the same name. E.g.
Revision history
As soon as you use the same name, then you have collisions during the migration, and Alembic complains that there is already a constraint with the same name). So what I would propose is to silently add the current migration (7a6587e16f4c) to develop and provenance_redesign at the right place (migration order) and then modify the migration 59edaf8a8b79 to also drop the old constraint. We can also add a check to 59edaf8a8b79, if possible, to verify that the old constraint is there before dropping it. This will cover the users that use develop, stopped at a version just before 59edaf8a8b79 and try to apply 59edaf8a8b79 without having applied 7a6587e16f4c - this is a very rare case scenario but OK... Let me know |
So you would insert |
I would insert |
|
I checked alembic migrations and even if they support merging, my understanding is that it will not cover our case
We can have multiple branches but at the mergepoint all the migrations (from both branches) are applied. Let me know if I missed something. I will proceed with the solution that we discussed in person the previous days. |
The complementary PR for provenance_redesign/develop is the #2514 |
Is it ready to merge? |
@ltalirz @giovannipizzi @sphuber |
@ltalirz |
I merged PR #2514 for the backport of the migration included in this PR, so whenever this is ready, I guess it is good to be merged |
This PR solves issue #1319 and provides a fast and non-ORM way to add nodes to a group. Import/export and more specifically import under SQLA should benefit from this.
@ltalirz Could you, please, verify that it works as expected at one of your databases (the corresponding tests pass)?
@giovannipizzi We should decide which version we will use at aiida/orm/implementation/sqlalchemy/group.py
I currently use the RAW SQLA query which is the fastest. I commented out the version that uses SQLA to construct the query since it needs SQLA 1.1+ and this is not supported at that version of AiiDA. Let's decide if it is worth it to make AiiDA compatible with SQLA 1.1+ or to keep the raw SQL version for 0.12
If we keep the raw SQL approach, maybe, we should do the group addition in batches and launch several SQL queries - I am not sure if I can add 1M of node ids in one INSERT SQL.