-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: Performance enhancement for insert_rows method DbApiHook with fast executemany + SAP Hana support #37246
Conversation
…st executemany method on the cursor can be used to achieve better performance when inserting in bulk. Also check if dialect is SAP Hana in _generate_insert_sql method so the UPSERT statement can be used as REPLACE INTO doesn't exist on SAP Hana.
I think it should be two-fold:
We already have a mechanims that when you use the common SQL Operators, you can pass hook kwargs as |
Thank you Jarek, I understand now what you are proposing, will try to do it like this. |
…are parametrized # Conflicts: # tests/providers/common/sql/hooks/test_dbapi.py
I've adapted the so that the format of the insert and replace statements are now parametrized and thus independant of any dialect. I just want to implement a test with an airflow connection to make sure the parameter in the DbApiHook is well interpreted from the extras field, once test is written the code can be reviewed then. |
Yep. That's great direction. |
Also what you could do with this change is likely to implement this: #34860 which will follow the same patterm but by Snowflake Hook |
We could think of some kind of strategy that is applied for certain hooks/or sqlalchemy dialects. We could have a default strategy, and if a specific strategy is found (base on dialect name for example), then use that one. This is something I was also thinking about now that you mention the above issue. Maybe something for another pull request, as I would like to finish this one as fast as possible as we are also impacted by it. I would also like to take the oppurtunity to add an common SQL operator which allows you to easily persist (insert/replace/upsert) a list of dicts into the database so that you don't need a python operator/decorated task with a hook, something that is now missing in Airflow I. The operator would also simplify the DAG's even more and is also benificient for the DAG processing. We tend to have as less as possible (custom) python code in our DAGs to improve governance and have fast solutions and easy to maintain DAG's which are less error prone. |
Yep. why not. Nothing against doing it in separate PR. |
OK. To merge it now and we can do further PRs/iterations to add more functionality here. |
@@ -37,6 +37,7 @@ | |||
|
|||
import sqlparse | |||
from deprecated import deprecated | |||
from more_itertools import chunked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add more_itertools
as a dependency for the provider?
Upon testing RC Provider common.sql: 1.11.0rc2, I am getting the below error
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/airflow/providers/common/sql/operators/sql.py", line 28, in <module>
from airflow.providers.common.sql.hooks.sql import DbApiHook, fetch_all_handler, return_single_query_results
File "/usr/local/lib/python3.11/site-packages/airflow/providers/common/sql/hooks/sql.py", line 40, in <module>
from more_itertools import chunked
ModuleNotFoundError: No module named 'more_itertools'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm looks like dependency is missing for more-itertools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Good point. We need to do it, we missed that one. Great catch @pankajkoti . We do have more-itertools in our CI image because it's a transitive dependency of a few other packages (hatch/ twine/salesforce) but we definitely need to add it for common.sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the quick fix. The (Astronomer) Providers RC automation test suite reported this failure :)
… with fast executemany + SAP Hana support (apache#37246) * refactor: Added executemany parameter to insert_rows method so the fast executemany method on the cursor can be used to achieve better performance when inserting in bulk. Also check if dialect is SAP Hana in _generate_insert_sql method so the UPSERT statement can be used as REPLACE INTO doesn't exist on SAP Hana. * refactor: Reformatted code to be static check compliant * refactor: Refactored DbApiHook so that insert and replace statements are parametrized # Conflicts: # tests/providers/common/sql/hooks/test_dbapi.py * refactor: Fixed some formatting in DbApiHook --------- Co-authored-by: David Blain <david.blain@infrabel.be>
… with fast executemany + SAP Hana support (apache#37246) * refactor: Added executemany parameter to insert_rows method so the fast executemany method on the cursor can be used to achieve better performance when inserting in bulk. Also check if dialect is SAP Hana in _generate_insert_sql method so the UPSERT statement can be used as REPLACE INTO doesn't exist on SAP Hana. * refactor: Reformatted code to be static check compliant * refactor: Refactored DbApiHook so that insert and replace statements are parametrized # Conflicts: # tests/providers/common/sql/hooks/test_dbapi.py * refactor: Fixed some formatting in DbApiHook --------- Co-authored-by: David Blain <david.blain@infrabel.be>
…st executemany method on the cursor can be used to achieve better performance when inserting in bulk. Also check if dialect is SAP Hana in _generate_insert_sql method so the UPSERT statement can be used as REPLACE INTO doesn't exist on SAP Hana.
Added executemany parameter to insert_rows method of DbApiHook so the fast executemany method on the cursor can be used to achieve better performance when inserting in bulk instead of iterating over each row and inserting it one by one which is not efficient. Also check if dialect is SAP Hana in _generate_insert_sql method so the UPSERT statement can be used as REPLACE INTO doesn't exist on SAP Hana. Those changes have been unit tested in TestDbApiHook.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.