Sensor for Databricks partition and table changes #28950

harishkrao · 2023-01-15T04:52:43Z

Sensors for Databricks SQL to detect table partitions and new table events.

harishkrao · 2023-01-15T05:55:25Z

@alexott it would be great if you can review the PR and suggest feedback. Thank you for your time.

alexott

Thank you for your contribution!

I would make it the DatabricksSqlSensor really generic by allowing to pass an arbitrary SQL expression that will trigger sensor if it has non-empty result. And then on top of it we can build Partition & History change sensors.

Also we need:

Documentation
Sensor should be declared in the airflow/providers/databricks/provider.yaml

airflow/providers/databricks/sensors/databricks.py

Taragolis · 2023-01-15T11:42:25Z

airflow/providers/databricks/sensors/databricks.py

+    """
+    Generic Databricks SQL sensor.
+
+    :param databricks_conn_id:str=DatabricksSqlHook.default_conn_name: Specify the name of the connection


You don't need include type to the docstrings, just use :param {argname}: {description}, all information from about expected types get from annotations.

Include this may cause issues when documentation will generated.

Removed the type. Thanks for the feedback.

Taragolis · 2023-01-15T11:45:00Z

airflow/providers/databricks/sensors/databricks.py

+
+        Args:
+            context (Context): Airflow context
+            lookup_key (_type_): Unique lookup key used to store values related to a specific table.
+
+        Returns:
+            int: Version number


We don't use Google style in docstring, please use reStructuredText

Changed it.

airflow/providers/databricks/sensors/databricks.py

Taragolis · 2023-01-15T11:53:00Z

airflow/providers/databricks/sensors/databricks.py

+        partition_name: dict = {"date": "2023-1-1"},
+        handler: Callable[[Any], Any] = fetch_all_handler,
+        db_sensor_type: str,
+        timestamp: datetime = datetime.now() - timedelta(days=7),


I think we do not need any default value for timestamp.
If it mandatory field, just make it mandatory and user should provide actual value here

The reason why I thought it needs to have some default value is to extract history for a time period irrespective of a custom value provided by the user, for example - past 7 days.

airflow/providers/databricks/sensors/databricks.py

alexott · 2023-01-15T12:05:02Z

Also, it makes sense to declare some of properties as templatised. For example, partitions mapping, etc.

alexott · 2023-01-15T13:09:46Z

Also, for partition sensor it would make sense to allow to specify operations on the partitions, like, allow not only = as comparison, but also in (if value is a list), >/</!=/...

harishkrao · 2023-01-16T02:48:04Z

Also, it makes sense to declare some of properties as templatised. For example, partitions mapping, etc.

@alexott agree, that would be good to implement. Do you have an example to follow for this one?

alexott · 2023-01-16T07:24:22Z

Yes, just look into DatabricksSqlOperator - it allows to templatize sql field. Just take into account that template expansion happens after __init__ is called

harishkrao · 2023-01-23T07:12:40Z

@alexott @Taragolis Thank you for taking the time to review the PR and for giving valuable feedback. I have addressed all of the comments. It would be great if you can review the changes. Thank you again!

alexott · 2023-01-29T12:54:42Z

airflow/providers/databricks/provider.yaml

+  - integration-name: Databricks SQL
+    python-modules:
+      - airflow.providers.databricks.sensors.databricks_sql
+  - integration-name: Databricks Partition
+    python-modules:
+      - airflow.providers.databricks.sensors.databricks_partition
+  - integration-name: Databricks Table Changes
+    python-modules:
+      - airflow.providers.databricks.sensors.databricks_table_changes


We can combine all of them under the Databricks SQL umbrella

alexott · 2023-01-29T12:56:44Z

airflow/providers/databricks/sensors/databricks_partition.py

+        table_name: str = "",
+        partition_name: dict,
+        handler: Callable[[Any], Any] = fetch_one_handler,
+        caller: str = "DatabricksPartitionSensor",


let hardcode caller instead of passing it as an argument.

alexott · 2023-01-29T12:57:19Z

airflow/providers/databricks/sensors/databricks_partition.py

+        databricks_conn_id: str = DatabricksSqlHook.default_conn_name,
+        http_path: str | None = None,
+        sql_endpoint_name: str | None = None,
+        session_configuration=None,
+        http_headers: list[tuple[str, str]] | None = None,
+        catalog: str = "",
+        schema: str = "default",
+        table_name: str = "",


Can we inherit most of the parameters from DatabricksSqlSensor?

Yes, changed it.

alexott · 2023-01-29T12:58:33Z

airflow/providers/databricks/sensors/databricks_partition.py

+        self.handler = handler
+        self.partition_operator = partition_operator
+
+    def _get_hook(self) -> DatabricksSqlHook:


If we pass all parameters to the DatabricksSqlSensor, then we can simply inherit that hook from it.

Good point, implemented the change.

alexott · 2023-01-29T13:00:00Z

airflow/providers/databricks/sensors/databricks_partition.py

+        return sql_result
+
+    def _check_table_partitions(self) -> list:
+        if self.catalog is not None:


it will be always executed this way because we're passing empty string as default. Also, we can continue to rely on two-level and even one level naming, relying on default catalog & default schema...

Changed it.

alexott · 2023-01-29T13:11:21Z

airflow/providers/databricks/sensors/databricks_sql.py

+            if len(result) >= 1:
+                return True
+            else:
+                return False


this is really just return len(result) > 0

Changed it.

alexott · 2023-01-29T13:11:47Z

airflow/providers/databricks/sensors/databricks_table_changes.py

+        databricks_conn_id: str = DatabricksSqlHook.default_conn_name,
+        http_path: str | None = None,
+        sql_endpoint_name: str | None = None,
+        session_configuration=None,
+        http_headers: list[tuple[str, str]] | None = None,
+        catalog: str = "",
+        schema: str = "default",
+        table_name: str = "",
+        handler: Callable[[Any], Any] = fetch_all_handler,
+        timestamp: datetime = datetime.now() - timedelta(days=7),
+        caller: str = "DatabricksTableChangesSensor",
+        client_parameters: dict[str, Any] | None = None,


Same comments as for partition sensor

Changed it.

alexott · 2023-01-29T13:13:03Z

airflow/providers/databricks/sensors/databricks_table_changes.py

+
+    def get_current_table_version(self, table_name, time_range):
+        change_sql = (
+            f"SELECT COUNT(version) as versions from "


what about selecting max(version) instead of count?

Agree, will change it.

alexott · 2023-01-29T13:13:40Z

airflow/providers/databricks/sensors/databricks_table_changes.py

+        change_sql = (
+            f"SELECT COUNT(version) as versions from "
+            f"(DESCRIBE HISTORY {table_name}) "
+            f"WHERE timestamp >= '{time_range}'"


if time_range doesn't change between calls, we may return true all the time. Why not compare with the latest version instead?

Does it make sense to report only data changes? Otherwise we'll get not necessary changes like after VACUUM, OPTIMIZE, ...

I am working on this, will push the changes soon.

Added these changes in the recent push.

alexott · 2023-01-29T13:14:01Z

airflow/providers/databricks/sensors/databricks_table_changes.py

+        if self.catalog is not None:
+            complete_table_name = str(self.catalog + "." + self.schema + "." + self.table_name)
+            self.log.debug("Table name generated from arguments: %s", complete_table_name)
+        else:
+            raise AirflowException("Catalog name not specified, aborting query execution.")


Same comment as for partition sensor.

Removed it.

harishkrao · 2023-02-09T08:00:54Z

airflow/providers/databricks/utils/databricks.py

@@ -17,6 +17,9 @@
 # under the License.


@alexott FYI: The Escaper class was not available on the main branch. So, I added it as part of this PR to be used in our provider.
Also, made some minor changes to exception handling in the Escaper class because it was looking for some user defined Exception classes from pyhive.

Hmmm, it should be a part of databricks-sql-connector: https://github.com/databricks/databricks-sql-python/blob/main/src/databricks/sql/utils.py#L121

How do I use the Escaper class in databricks-sql-python within Airflow? Do I add it to setup.py to be installed when Airflow starts?

Thanks for the help, I imported it via from databricks.sql.utils import ParamEscaper

alexott

There are some minor changes required, but otherwise - looks good

alexott · 2023-02-12T10:47:06Z

airflow/providers/databricks/sensors/databricks_partition.py

+    :param _catalog: An optional initial catalog to use. Requires DBR version 9.0+ (templated)
+    :param _schema: An optional initial schema to use. Requires DBR version 9.0+ (templated)


why starting with _? It then doesn't match other operators

Changed it and made it consistent with other operators.

alexott · 2023-02-12T10:56:02Z

airflow/providers/databricks/sensors/databricks_partition.py

+                        output_list.append(
+                            f"""{partition_col}{self.partition_operator}{self.escaper.escape_item(partition_value)}"""
+                        )
+                    if isinstance(partition_value, (str, datetime.date)):


what about timestamps aka datetime.datetime ?

Changed it.

alexott · 2023-02-12T10:57:48Z

airflow/providers/databricks/sensors/databricks_partition.py

+        )
+        return self._sql_sensor(partition_sql)
+
+    def _get_results(self, context: Context) -> bool:


why we need context here?

We do not need it, removed it.

alexott · 2023-02-12T10:59:50Z

airflow/providers/databricks/sensors/databricks_sql.py

+        Databricks connection's extra parameters.
+    :param http_headers: An optional list of (k, v) pairs that will be set as HTTP headers on every request.
+    :param client_parameters: Additional parameters internal to Databricks SQL Connector parameters.
+    :param sql: SQL query to be executed.


Not all parameters for __init__ are documented.

airflow/providers/databricks/sensors/databricks_sql.py

alexott · 2023-02-12T11:20:26Z

airflow/providers/databricks/sensors/databricks_partition.py

+        if len(result) < 1:
+            raise AirflowException("Databricks SQL partition sensor failed.")


this handles lack of results differently than the generic sensor.

Unified the handling across classes.

alexott · 2023-02-12T11:20:56Z

airflow/providers/databricks/sensors/databricks_table_changes.py

+    :param _catalog: An optional initial catalog to use.
+        Requires DBR version 9.0+ (templated), defaults to ""
+    :param _schema: An optional initial schema to use.
+        Requires DBR version 9.0+ (templated), defaults to "default"


same here - unify names with base operator

alexott · 2023-02-12T11:22:49Z

airflow/providers/databricks/sensors/databricks_table_changes.py

+    def get_current_table_version(self, table_name, time_range, operator):
+        _count_describe_literal = "SELECT MAX(version) AS versions FROM (DESCRIBE HISTORY"
+        _filter_predicate_literal = ") WHERE timestamp"
+        _operation_filter_literal = "AND operation NOT LIKE '%CONVERT%' AND operation NOT LIKE '%OPTIMIZE%' \


could be simpler to do operation NOT IN ('CONVERT', 'OPTIMIZE', ...) ?

Also, need to add FSCK. See full list here: https://docs.delta.io/latest/delta-utility.html#retrieve-delta-table-history

Yes, added FSCK. Changed it to a NOT IN filter. Additionally, interesting to note that (on running a few commands like) FSCK, OPTIMIZE are not recorded in the history of the Delta table.

Regarding VACUUM START, ... let's try to use just single % -> VACUUM% to avoid for searching by substring

Good point, changed it.

alexott · 2023-02-12T11:25:35Z

airflow/providers/databricks/sensors/databricks_table_changes.py

+    def __init__(
+        self,
+        table_name: str,
+        timestamp: datetime = datetime.now() - timedelta(days=7),


what about making this optional? for example, I want to check changes without taking timestamp into account

alexott · 2023-02-12T11:27:26Z

airflow/providers/databricks/utils/databricks.py

@@ -17,6 +17,9 @@
 # under the License.


Hmmm, it should be a part of databricks-sql-connector: https://github.com/databricks/databricks-sql-python/blob/main/src/databricks/sql/utils.py#L121

harishkrao · 2023-02-13T07:55:14Z

There are some minor changes required, but otherwise - looks good

@alexott thank you for taking the time to review! I resolved all the comments (except the escaper class).

eladkal · 2023-02-17T15:41:41Z

@harishkrao does this PR solves #21381 ?

alexott · 2023-02-19T09:37:58Z

@eladkal yes, it will solve #21381

alexott

See my comment about returning false vs. throwing an exception when there is no results.

But primary request for changes is for adding missing pieces:

We need documentation be added as well
Documentation should include examples - add a sensor example to tests/system/providers/databricks - it will be used for integration tests

airflow/providers/databricks/sensors/databricks_sql.py

harishkrao · 2023-02-20T08:47:05Z

See my comment about returning false vs. throwing an exception when there is no results.

But primary request for changes is for adding missing pieces:
* We need documentation be added as well

* Documentation should include examples - add a sensor example to `tests/system/providers/databricks` - it will be used for integration tests

@alexott just pushed an example DAG file, similar to the ones for Operators.

eladkal

Apparently I missed stuff because I reviewed from my phone
I think the code requires some more work. I'm not familiar with Databricks but the sensors seems very complex and I wonder for all of them if logic shouldn't be in hook?

@o-nikolas @josh-fell i appreciate another eye here.

eladkal · 2023-02-27T20:58:13Z

airflow/providers/databricks/sensors/table_changes.py

+    def get_previous_version(context: Context, lookup_key):
+        return context["ti"].xcom_pull(key=lookup_key, include_prior_dates=True)


I don't understand the xcom part.
Why the sensor push and pull from xcom on every poke?

We store metadata on the most recently queried version for that table. And we send/receive them using xcom. On querying the current version from Databricks, we compare it with the one stored in the metadata and take an action accordingly.

I don't believe there is a guarantee that the most recent XCom will be pulled here. Behind the scenes XCom.get_many() is called and just retrieves the first record. At the mercy of the metadatabase being used there.

Also, what happens in a mapped operator situation? If the XCom key is always the same, it seems possible this can pull an XCom key for an entirely different task since task_ids and map_index is not specified in the xcom_pull() call.

Another question then would be what if the input args are the same (i.e. checking for changes in the same table) but a user simply updates the task_id. Would this sensor yield a false positive that there was indeed a change?

I don't necessarily have answers to these questions on the top of my head, but some things to think about with using XComs in this way.

eladkal · 2023-02-27T20:59:48Z

airflow/providers/databricks/sensors/table_changes.py

+    def _get_results_table_changes(self, context) -> bool:
+        complete_table_name = str(self.catalog + "." + self.schema + "." + self.table_name)
+        self.log.debug("Table name generated from arguments: %s", complete_table_name)
+
+        prev_version = -1
+        if context is not None:
+            lookup_key = complete_table_name
+            prev_data = self.get_previous_version(lookup_key=lookup_key, context=context)
+            self.log.debug("prev_data: %s, type=%s", str(prev_data), type(prev_data))
+            if isinstance(prev_data, int):
+                prev_version = prev_data
+            elif prev_data is not None:
+                raise AirflowException("Incorrect type for previous XCom data: %s", type(prev_data))
+            version = self.get_current_table_version(table_name=complete_table_name)
+            self.log.debug("Current version: %s", version)
+            if version is None:
+                return False
+            if prev_version < version:
+                result = True
+            else:
+                return False
+            if prev_version != version:
+                self.set_version(lookup_key=lookup_key, version=version, context=context)
+            self.log.debug("Result: %s", result)
+            return result
+        return False
+
+    def poke(self, context: Context) -> bool:
+        return self._get_results_table_changes(context=context)


This looks very complicated.
Sensor should ask simple question. Most of the logic for the operations should be functions in the hook (so they can also be utalized for other sensor/custom sensor users can create)

@alexott can you please weigh in on the design decisions we made to arrive at this pattern?

+1. Some of these functions look like they could be handy if made generally available as part of a hook.

eladkal · 2023-02-27T21:00:29Z

airflow/providers/databricks/sensors/partition.py

+                        output_list.append(
+                            f"""{partition_col}{self.partition_operator}{self.escaper.escape_item(partition_value)}"""
+                        )
+                    # TODO: Check date types.


I can remove this.

eladkal · 2023-02-27T21:05:12Z

airflow/providers/databricks/sensors/sql.py

+from airflow.utils.context import Context
+
+
+class DatabricksSqlSensor(BaseSensorOperator):


I'm missing something here.
If this sensor leverages DbApiHook why doesn't it subclass SqlSensor?

I can change it to inherit the SqlSensor.

harishkrao · 2023-02-27T22:45:18Z

Apparently I missed stuff because I reviewed from my phone I think the code requires some more work. I'm not familiar with Databricks but the sensors seems very complex and I wonder for all of them if logic shouldn't be in hook?

@o-nikolas @josh-fell i appreciate another eye here.

@alexott can you please elaborate on the reasoning for why we wrote the sensors with this design?

o-nikolas · 2023-03-01T02:58:09Z

tests/system/providers/databricks/example_databricks_sensors.py

+) as dag:
+    # [docs]
+    connection_id = "databricks_default"
+    sql_endpoint_name = "Starter Warehouse"


We usually add test code to setup the resource under test, or at least make it configurable (os env var etc) so that users can setup their own and supply the correct config to test against it.

o-nikolas · 2023-03-01T02:58:19Z

tests/system/providers/databricks/example_databricks_sensors.py

+from airflow.providers.databricks.sensors.sql import DatabricksSqlSensor
+from airflow.providers.databricks.sensors.table_changes import DatabricksTableChangesSensor
+
+# [docs]


What is the purpose of these tags?

o-nikolas · 2023-03-01T03:15:43Z

airflow/providers/databricks/sensors/table_changes.py

+            self.log.debug("prev_data: %s, type=%s", str(prev_data), type(prev_data))
+            if isinstance(prev_data, int):
+                prev_version = prev_data
+            elif prev_data is not None:
+                raise AirflowException("Incorrect type for previous XCom data: %s", type(prev_data))


IMHO all this logic should be inside get_previous_version() rather than here.

o-nikolas · 2023-03-01T03:20:43Z

airflow/providers/databricks/sensors/table_changes.py

+            version = self.get_current_table_version(table_name=complete_table_name)
+            self.log.debug("Current version: %s", version)
+            if version is None:
+                return False
+            if prev_version < version:
+                result = True
+            else:
+                return False
+            if prev_version != version:
+                self.set_version(lookup_key=lookup_key, version=version, context=context)


I don't think I fully understand this logic. The two False cases can certainly be collapsed to be more compact, but also, shouldn't the false case be setting result rather than returning? If they return then the code to store the version in xcom is not executed. You're basically always comparing version with the prev_version default value of -1 from what I can tell.

o-nikolas · 2023-03-01T03:22:30Z

airflow/providers/databricks/sensors/table_changes.py

+        self.log.debug("Table name generated from arguments: %s", complete_table_name)
+
+        prev_version = -1
+        if context is not None:


Are we really worried about context being missing?
If so, then just add a statement like:

if not context: return False

This way the whole main block of code doesn't have to be indented.
Also if context is really missing you may want to throw an exception instead of just returning False.

o-nikolas · 2023-03-01T03:24:12Z

airflow/providers/databricks/sensors/table_changes.py

+    def set_version(context: Context, lookup_key, version):
+        context["ti"].xcom_push(key=lookup_key, value=version)
+
+    def get_current_table_version(self, table_name):


Can any or all of this be pushed into the hook? Validating things like operators doesn't seem like the right thing

o-nikolas · 2023-03-01T03:25:13Z

airflow/providers/databricks/sensors/partition.py

+        if len(partition_columns) < 1:
+            raise AirflowException("Table %s does not have partitions", table_name)
+        formatted_opts = ""
+        if opts is not None and len(opts) > 0:


Suggested change

if opts is not None and len(opts) > 0:

if opts:

josh-fell · 2023-03-01T02:24:23Z

airflow/providers/databricks/sensors/sql.py

+from airflow.providers.common.sql.hooks.sql import fetch_all_handler
+from airflow.providers.databricks.hooks.databricks_sql import DatabricksSqlHook
+from airflow.sensors.base import BaseSensorOperator
+from airflow.utils.context import Context


Since this import is only used for typing it should be put behind typing.TYPE_CHECKING. One fewer import at runtime. Applicable to all of the other net-new modules in this PR too.

josh-fell · 2023-03-01T02:29:41Z

airflow/providers/databricks/sensors/sql.py

+        if len(result) < 1:
+            return False
+        return True


Suggested change

if len(result) < 1:

return False

return True

return bool(result)

Small optimization.

josh-fell · 2023-03-01T02:33:11Z

airflow/providers/databricks/sensors/sql.py

+        """Sensor to execute SQL statements on a Delta table via Databricks.
+
+        :param databricks_conn_id: Reference to :ref:`Databricks
+            connection id<howto/connection:databricks>` (templated), defaults to
+            DatabricksSqlHook.default_conn_name
+        :param http_path: Optional string specifying HTTP path of Databricks SQL Endpoint or cluster.
+            If not specified, it should be either specified in the Databricks connection's
+            extra parameters, or ``sql_endpoint_name`` must be specified.
+        :param sql_endpoint_name: Optional name of Databricks SQL Endpoint. If not specified, ``http_path``
+            must be provided as described above, defaults to None
+        :param session_configuration: An optional dictionary of Spark session parameters. If not specified,
+            it could be specified in the Databricks connection's extra parameters., defaults to None
+        :param http_headers: An optional list of (k, v) pairs
+            that will be set as HTTP headers on every request. (templated).
+        :param catalog: An optional initial catalog to use.
+            Requires DBR version 9.0+ (templated), defaults to ""
+        :param schema: An optional initial schema to use.
+            Requires DBR version 9.0+ (templated), defaults to "default"
+        :param sql: SQL statement to be executed.
+        :param handler: Handler for DbApiHook.run() to return results, defaults to fetch_all_handler
+        :param client_parameters: Additional parameters internal to Databricks SQL Connector parameters.
+        """


There are two sets of docstrings for this sensor's construction. Can you consolidate please?

josh-fell · 2023-03-01T02:35:41Z

airflow/providers/databricks/sensors/table_changes.py

+    def _get_results_table_changes(self, context) -> bool:
+        complete_table_name = str(self.catalog + "." + self.schema + "." + self.table_name)
+        self.log.debug("Table name generated from arguments: %s", complete_table_name)
+
+        prev_version = -1
+        if context is not None:
+            lookup_key = complete_table_name
+            prev_data = self.get_previous_version(lookup_key=lookup_key, context=context)
+            self.log.debug("prev_data: %s, type=%s", str(prev_data), type(prev_data))
+            if isinstance(prev_data, int):
+                prev_version = prev_data
+            elif prev_data is not None:
+                raise AirflowException("Incorrect type for previous XCom data: %s", type(prev_data))
+            version = self.get_current_table_version(table_name=complete_table_name)
+            self.log.debug("Current version: %s", version)
+            if version is None:
+                return False
+            if prev_version < version:
+                result = True
+            else:
+                return False
+            if prev_version != version:
+                self.set_version(lookup_key=lookup_key, version=version, context=context)
+            self.log.debug("Result: %s", result)
+            return result
+        return False
+
+    def poke(self, context: Context) -> bool:
+        return self._get_results_table_changes(context=context)


+1. Some of these functions look like they could be handy if made generally available as part of a hook.

josh-fell · 2023-03-01T02:44:30Z

airflow/providers/databricks/sensors/table_changes.py

+        defaults to >=.
+    """
+
+    template_fields: Sequence[str] = ("databricks_conn_id", "catalog", "schema", "table_name")


IMO it would be useful to have timestamp as a template field too. I could foresee users wanting to use one of the built-in Jinja variables for this to the task is idempotent (like {{ data_interval_start }} for example). Or, be used as a dynamic input from a previous task.

josh-fell · 2023-03-01T03:08:34Z

airflow/providers/databricks/sensors/partition.py

+    def _sql_sensor(self, sql):
+        hook = self._get_hook()
+        sql_result = hook.run(
+            sql,
+            handler=self.handler if self.do_xcom_push else None,
+        )
+        return sql_result
+


Suggested change

def _sql_sensor(self, sql):

hook = self._get_hook()

sql_result = hook.run(

sql,

handler=self.handler if self.do_xcom_push else None,

)

return sql_result

Same idea here. This method exists in DatabricksSqlSensor.

josh-fell · 2023-03-01T03:09:31Z

airflow/providers/databricks/sensors/partition.py

+    def poke(self, context: Context) -> bool:
+        return self._get_results()


Technically could remove this too.

josh-fell · 2023-03-01T03:20:39Z

airflow/providers/databricks/sensors/table_changes.py

+    def get_previous_version(context: Context, lookup_key):
+        return context["ti"].xcom_pull(key=lookup_key, include_prior_dates=True)


I don't believe there is a guarantee that the most recent XCom will be pulled here. Behind the scenes XCom.get_many() is called and just retrieves the first record. At the mercy of the metadatabase being used there.

Also, what happens in a mapped operator situation? If the XCom key is always the same, it seems possible this can pull an XCom key for an entirely different task since task_ids and map_index is not specified in the xcom_pull() call.

Another question then would be what if the input args are the same (i.e. checking for changes in the same table) but a user simply updates the task_id. Would this sensor yield a false positive that there was indeed a change?

I don't necessarily have answers to these questions on the top of my head, but some things to think about with using XComs in this way.

josh-fell · 2023-03-01T03:29:32Z

tests/providers/databricks/sensors/test_partition.py

+)
+
+
+class TestDatabricksPartitionSensor(unittest.TestCase):


There is an ongoing effort to move away from unittest in favor of pytest. Since these tests are net-new, could you change this and the other tests in the PR to pytest please?

josh-fell · 2023-03-01T03:32:20Z

tests/system/providers/databricks/example_databricks_sensors.py

+    connection_id = "databricks_default"
+    sql_endpoint_name = "Starter Warehouse"
+
+    # [START howto_sensor_databricks_sql]


These START/END markers are used to include code snippets in guides (generally). It would be great if there was accompanying documentation for these new sensors and ones that take advantage of the snippets being outlined in this DAG. There are a lot of examples in operator guides on how this is done throughout the providers.

josh-fell · 2023-03-01T04:30:06Z

airflow/providers/databricks/sensors/partition.py

+        partition_columns = self._sql_sensor(f"DESCRIBE DETAIL {table_name}")[0][7]
+        self.log.info("table_info: %s", partition_columns)
+        if len(partition_columns) < 1:
+            raise AirflowException("Table %s does not have partitions", table_name)


Suggested change

raise AirflowException("Table %s does not have partitions", table_name)

raise AirflowException(f"Table {table_name} does not have partitions")

Otherwise the message logged will not be what you expect.

josh-fell · 2023-03-01T04:30:52Z

airflow/providers/databricks/sensors/partition.py

+                    # TODO: Check date types.
+                else:
+                    raise AirflowException(
+                        "Column %s not part of table partitions: %s", partition_col, partition_columns


Suggested change

"Column %s not part of table partitions: %s", partition_col, partition_columns

f"Column {partition_col} not part of table partitions: {partition_columns}"

Same here.

harishkrao · 2023-03-01T16:19:38Z

@eladkal @o-nikolas @josh-fell @alexott thanks for taking the time to review my code and provide feedback. I appreciate it. To incorporate the changes for the 3 sensors, I will break these down into 3 separate PRs so that it is easy to manage and test them individually rather than a bulk of changes in a single PR.
I will work on the changes and open the new PRs.
I will close the currently open PR.

boring-cyborg bot added area:providers provider:databricks labels Jan 15, 2023

alexott suggested changes Jan 15, 2023

View reviewed changes

Taragolis reviewed Jan 15, 2023

View reviewed changes

potiuk force-pushed the databricks-sql-sensor branch from 9b1b051 to 968e8d1 Compare January 20, 2023 22:27

harishkrao force-pushed the databricks-sql-sensor branch from 73953ce to 0030fc4 Compare January 23, 2023 06:18

harishkrao force-pushed the databricks-sql-sensor branch 4 times, most recently from 6a9e11b to 7781bd7 Compare January 29, 2023 03:03

alexott suggested changes Jan 29, 2023

View reviewed changes

harishkrao commented Feb 9, 2023

View reviewed changes

alexott suggested changes Feb 12, 2023

View reviewed changes

harishkrao force-pushed the databricks-sql-sensor branch 5 times, most recently from d18d4a7 to bf87dba Compare February 16, 2023 19:58

harishkrao force-pushed the databricks-sql-sensor branch from bf87dba to d5dd505 Compare February 18, 2023 00:25

alexott suggested changes Feb 19, 2023

View reviewed changes

airflow/providers/databricks/sensors/databricks_sql.py Outdated Show resolved Hide resolved

harishkrao added 19 commits February 27, 2023 12:44

Added return statement

6805657

Removed _get_hook from child classes

d6c0be6

Removed _get_hook from child classes

40d8b62

Used ParamEscaper from Databricks SQL Python connector

26909dd

Used ParamEscaper from Databricks SQL Python connector

478d5d5

Changed utils file

f6ca961

Modifications to unit tests

bf5f07d

Linter

511bc19

Added False for failure, added example

d872d58

Added False for failure, added example

42605e8

Changed unit tests

5fdfce3

Changed unit tests

e493677

Removed duplicate test run

601ff4f

Removed duplicate test run

af3ccc1

Removed duplicate test run

44cff1c

Fixed typo in example DAG

9763fc9

Refactored prefixes, removed entries from spelling wordlist

3a09c36

Added docstring for SQL sensor

1cd1414

Added docstring for SQL sensor

656814e

harishkrao force-pushed the databricks-sql-sensor branch from d73e516 to 656814e Compare February 27, 2023 20:44

eladkal reviewed Feb 27, 2023

View reviewed changes

o-nikolas reviewed Mar 1, 2023

View reviewed changes

josh-fell reviewed Mar 1, 2023

View reviewed changes

harishkrao closed this Mar 1, 2023

harishkrao mentioned this pull request Mar 20, 2023

Databricks SQL Sensor #30204

Closed

harishkesavarao mentioned this pull request May 1, 2023

Add DatabricksPartitionSensor #30980

Merged

		:param _catalog: An optional initial catalog to use. Requires DBR version 9.0+ (templated)
		:param _schema: An optional initial schema to use. Requires DBR version 9.0+ (templated)

		if len(result) < 1:
		raise AirflowException("Databricks SQL partition sensor failed.")

		def get_previous_version(context: Context, lookup_key):
		return context["ti"].xcom_pull(key=lookup_key, include_prior_dates=True)

		from airflow.utils.context import Context


		class DatabricksSqlSensor(BaseSensorOperator):

		def poke(self, context: Context) -> bool:
		return self._get_results()

	raise AirflowException("Table %s does not have partitions", table_name)
	raise AirflowException(f"Table {table_name} does not have partitions")

	"Column %s not part of table partitions: %s", partition_col, partition_columns
	f"Column {partition_col} not part of table partitions: {partition_columns}"

Sensor for Databricks partition and table changes #28950

Sensor for Databricks partition and table changes #28950

Conversation

harishkrao commented Jan 15, 2023 • edited by eladkal Loading

harishkrao commented Jan 15, 2023

alexott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexott commented Jan 15, 2023

alexott commented Jan 15, 2023

harishkrao commented Jan 16, 2023

alexott commented Jan 16, 2023

harishkrao commented Jan 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harishkrao Feb 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harishkrao commented Feb 13, 2023

eladkal commented Feb 17, 2023

alexott commented Feb 19, 2023

alexott left a comment

Choose a reason for hiding this comment

harishkrao commented Feb 20, 2023

eladkal left a comment

Choose a reason for hiding this comment

eladkal Feb 27, 2023 • edited Loading

Choose a reason for hiding this comment

harishkrao Feb 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harishkrao commented Jan 15, 2023 •

edited by eladkal

Loading

harishkrao Feb 13, 2023 •

edited

Loading

eladkal Feb 27, 2023 •

edited

Loading

harishkrao Feb 27, 2023 •

edited

Loading

harishkrao Feb 27, 2023 •

edited

Loading