Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new feature for running and logging dbt commands #841

Merged
merged 10 commits into from
Jan 18, 2024

Conversation

austinweisgrau
Copy link
Collaborator

dbt is an increasingly popular tool in the data analytics community, for very good reason!

dbt's primary interface is through shell commands, which makes interacting with dbt commands programmatically somewhat difficult.

This new feature enables executing dbt commands from python, logs results to the python logger and optionally can also send formatted results to slack. The slack formatted message follows the formatting convention that dbt cloud uses for its slack messages.

@austinweisgrau
Copy link
Collaborator Author

Recent updates to dbt enable direct execution of dbt commands in python without needing to open a shell process. Using that dbt feature would be an improvement to this method. Docs on the new dbt feature here.

@shaunagm
Copy link
Collaborator

This looks great @austinweisgrau! Although I'm unfamiliar with dbt myself and hoping someone else will be able to also review this.

Some more general questions/thoughts:

  • Will this only work with Redshift? If so, is that because that's all you have time/access to build around, or is there a more fundamental reason (like the underlying functionality is build for Redshift)?
  • I wonder where the best place to put this is? Maybe in databases? I think not at the top level like it currently is.
  • We'll need docs and tests for these, before they can be merged.

Thank you again!

@austinweisgrau
Copy link
Collaborator Author

Good questions!

  • As currently built, will only work with Redshift. Should be simple to build modifications to enable it to work with other databases, but that's just what I am using and took the time to write up.
  • Good question! I don't have a great understanding of the organization of the package so I'm not sure.
  • Where's the best place to put the docs? Docstrings? Top of the module? In a readme?

Copy link
Collaborator

@shaunagm shaunagm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some structural thoughts around this which I've tried to convey in comments, but I'm also happy to hop on a call if you're up for that - it might make explaining easier. I can also take a stab at doing the refactoring myself if you don't have the time to spend on making this a little more generalizable.

I think probably this should go in the utilities folder. Which also answers the question of where the docs should go. The utilities are currently terribly underdocumented, but you can add a general description to this file, plus the autodocs syntax so it will grab your docstrings. Let me know if you need more instruction.

logger = logging.getLogger(__name__)


class dbtLogger:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more concerned with abstracting from "Redshift" to "works with all databases" but we could also abstract from "Slack" to "works with any notification system".

To be clear, I'm not asking you to build that capability, just thinking on how we could structure this so it's easy for someone else to come build that capability. In particular, I'm thinking about dependency injection/inversion of control. I don't have much CS background - do you? My understanding is that in cases like these we'd want to create a logger that accepts a notification/messaging object and calls message( on it and a runner that accepts a database connector and calls query( on it but doesn't care about the implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this whole thing, I realize now that you're not really using any of the DB connectors directly so you're not calling query(. Instead, you're using the dbt_redshift dependency and sending connection info and commands directly to it.

So the challenge becomes whether we can abstract over the dbt_redshift interface vs the dbt_bigquery interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is resolved with 088f974, which allows for this all to work with any database engine that works with dbt. This is actually handled at the level of package dependencies rather than in the code itself. So now the dbt extra can be installed with any database-specific version of dbt. e.g. pip install parsons[dbt-redshift], pip install parsons[dbt-bigquery] etc

full_log_message += f"\n*Duration:* {duration_time_str}\n\n"
full_log_message += "\n".join(self.log_messages)

if self.slack_webhook:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the main (only?) place where the logger is calling Slack directly, so it should be straightforward to lift out. I think the main challenge is handling the conditional here, on whether to use the webhook or the API key.

Also :red_circle: etc is going to be platform-specific but I don't think it's a big deal if a notification has :red_circle: instead of an actual red circle.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, if no slack configs are provided (no API key or webhook is passed in), this if/else block will not execute on any code and logging will remain limited to stdout.

Parsons has a few other notifications interfaces, but for a few reasons I don't think it's important to generalize the notifications of this particular utility over them at this time:

  • We discussed this PR in the TMC engineering working group, and no one had a use case besides Slack for notifications
  • Parsons itself doesn't have a generic notifications API, making it a bit cumbersome to try and build one into this PR

def run(self) -> None:
for command in self.commands:
self.dbt_command(command)
self.dbt_logger.send_to_slack()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd also want this line to be generic, like send_notification instead of Slack specifically.

@austinweisgrau
Copy link
Collaborator Author

Ok I finally looped back and made some updates:

  • Added instructions for how to install & use the utility
  • Moved the code from being its own connector to the utilities folder
  • Enabled installation with all the database engines that dbt itself supports

@shaunagm
Copy link
Collaborator

shaunagm commented Oct 2, 2023

I'm planning on re-reviewing this in a few weeks, after we've done the DBT Fundamentals cohort and I have a better mental sense of what DBT is and how it fits with Parsons. Just wanted to update this PR so no one reads it and wonders why it's been sitting un-merged for over a month.

@shaunagm
Copy link
Collaborator

Hi @austinweisgrau - sorry I'm just now getting back to this. This still looks good to me, and after taking the dbt course I better understand the value of this utility. I'm going to update the branch and merge it.

@shaunagm shaunagm merged commit 7f24b4c into move-coop:main Jan 18, 2024
9 checks passed
@austinweisgrau austinweisgrau deleted the dbt branch July 15, 2024 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants