Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit batch size for bulk_create operations #3713

Merged
merged 2 commits into from
Feb 12, 2020

Conversation

ltalirz
Copy link
Member

@ltalirz ltalirz commented Jan 15, 2020

partially addresses #3712

postgresql has a "MaxAllocSize" that defaults to 1 GB [1].
If you try to insert more than that in one go (e.g. during import of a
large AiiDA export file), you encounter the error

psycopg2.errors.ProgramLimitExceeded: out of memory
DETAIL:  Cannot enlarge string buffer containing 0 bytes by 1257443654 more bytes.

This commit avoids this issue (for the django backend) by setting a
batch size for bulk_create operations.

[1] https://github.com/postgres/postgres/blob/master/src/include/utils/memutils.h#L40

Open questions:

  • What should the batch size be? I guess it is in units of the django objects you are passing (see e..g here ), so the current value of 100k would allow each object to generate up to an average of 10 kB of strings per object (it did fix the issue for me)
  • Didn't look into sqla side, pointers welcome let's fix this in another PR
  • From the tests it seems the BATCH_SIZE variable should be moved somewhere else than the settings.py. Happy to move it wherever you guys think it should go.

@ltalirz ltalirz requested a review from giovannipizzi January 15, 2020 18:31
@ltalirz
Copy link
Member Author

ltalirz commented Jan 21, 2020

@sphuber I think Giovanni is quite busy now - would you mind having a look?
I think something like this should probably be in 1.1.0

@sphuber
Copy link
Contributor

sphuber commented Jan 22, 2020

Thanks @ltalirz good start, but couple things remain. I think we should also add this for SqlAlchemy. There is a page on the documentation about bulk insertions, the various available solutions and their performance. More detailed information on these methods can be found here. Currently, the SqlAlchemy import implementation just uses the session.add method, i.e. no bulk/batching whatsoever. I can have a look at some point if I can adapt it but not sure when I will have the time.

Since this functionality will have to be implemented for both backends, I think it makes sense to have the batch size be configured the same way. This would also allow us to make it configurable through verdi config. This solves your first checkbox, we simply go with that value for now and if that stops working for someone, they can simply adapt it through configuration. It also solves the second checkbox: the value will simply be taken from the configuration, for both backends.

@ltalirz
Copy link
Member Author

ltalirz commented Jan 22, 2020

Putting it into the config makes sense to me. Shall we say AIIDADB_BATCH_SIZE?

As for adding it to sqla as well - in order to get at least the django fix into the code, I suggest I go ahead with django only for the moment and open an issue with pointers on how to proceed for sqla. Sounds good?

@ltalirz ltalirz force-pushed the issue_3712_max_postgres branch from 5f21fbc to f55598d Compare February 11, 2020 12:08
@ltalirz ltalirz requested review from sphuber and removed request for giovannipizzi February 11, 2020 12:09
@ltalirz
Copy link
Member Author

ltalirz commented Feb 11, 2020

I've updated the PR to make it a config option.
In the interest of time, I suggest we leave the sqla implementation to later.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor things

@@ -471,9 +472,9 @@ def import_data_dj(
if 'mtime' in [field.name for field in model._meta.local_fields]:
with models.suppress_auto_now([(model, ['mtime'])]):
# Store them all in once; however, the PK are not set in this way...
model.objects.bulk_create(objects_to_create)
model.objects.bulk_create(objects_to_create, batch_size=get_config_option('db.batch_size'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to just fetch the value once and assign to local variable that is to be reused

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm yes... I was thinking there might be cases where you are creating a new profile, where you don't really want to use the setting of the current profile, so I thought it might be safer to start with a global option.
Anyhow, I think making it not only global is fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant this to be a reply to the other comment ;) Anyway, that is not really what the global_only means. When an option is global_only it means it can only be set instance-wide and not per profile. Currently this only applies to the user_* options used for making repeated profile creation easier, so there it doesn't make sense as profile specific. The resolution order of configuration settings is profile -> global -> default. So if an existing profile defines a specific db.batch_size for itself, a new profile would still get the default, because there is nothing set globally.

'description':
'Batch size for bulk CREATE operations in the database. Avoids hitting MaxAllocSize of PostgreSQL'
'(1GB) when creating large numbers of database records in one go.',
'global_only': True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to limit it to global option only? Maybe not the most common use case to have different batch sizes, but it is possible to have multiple profiles in one installation that connect to different databases on different machines with different memory limitations. Best to leave unconstrained I would say

postgresql has a "MaxAllocSize" that defaults to 1 GB [1].
If you try to insert more than that in one go (e.g. during import of a
large AiiDA export file), you encounter the error

    psycopg2.errors.ProgramLimitExceeded: out of memory
    DETAIL:  Cannot enlarge string buffer containing 0 bytes by 1257443654 more bytes.

This commit avoids this issue (for the django backend) by setting a
batch size for bulk_create operations (via a verdi config option).

[1] https://github.com/postgres/postgres/blob/master/src/include/utils/memutils.h#L40
max alloc" size
@ltalirz ltalirz force-pushed the issue_3712_max_postgres branch from f55598d to 5e470ac Compare February 11, 2020 18:38
@ltalirz
Copy link
Member Author

ltalirz commented Feb 11, 2020

Sorry for the long wait - was in meetings, meetings, meetings ;-)

@sphuber
Copy link
Contributor

sphuber commented Feb 12, 2020

Sorry for the long wait - was in meetings, meetings, meetings ;-)

Not to worry, thanks a lot for the improvement 👍

@sphuber sphuber merged commit b7bcf96 into aiidateam:develop Feb 12, 2020
@sphuber sphuber deleted the issue_3712_max_postgres branch February 12, 2020 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants