Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph Analysis Optimization for Large Dags #6720

Merged
merged 2 commits into from
Jan 26, 2023

Conversation

peterallenwebb
Copy link
Contributor

@peterallenwebb peterallenwebb commented Jan 25, 2023

resolves #6759

Description

This optimization reduces by 25% the time needed to run dbt compile in a large scenario provided to me by @boxysean. In that scenario, dbt originally spent 25% of its time in the GraphQueue._find_new_additions() function, which has been modified in the PR so that it takes almost no time at all.

The optimization works because the the in-degree of a node in the GraphQueue's graph only changes when we remove a node, and the only nodes whose in-degree will change are those that were successors of the deleted node. Instead of looking at every node in the graph to see whether its in-degree has changed to zero, we now look only at successors of the deleted node.

Checklist

@peterallenwebb peterallenwebb requested review from a team and iknox-fa January 25, 2023 04:41
@cla-bot cla-bot bot added the cla:yes label Jan 25, 2023
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

Copy link
Contributor

@dbeatty10 dbeatty10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it takes almost no time at all

🎉

This looks awesome @peterallenwebb 🧠

If I'm understanding correctly, the original implementation would have looked through all four of these nodes, no matter what.

image

In your new implementation, if Node 2 is removed, then only its successors (Nodes 3 and 4) will be visited, and only Node 3 will change to an in-degree of zero.

We can see that before it would visit all X nodes in the average case. But now the average case is just the typical number of successors (i.e., the mean number of out-degrees per node).

I would guess for most dbt projects, the mean number of out-degrees will be WAAAAY closer to 0 than it will be to X, hence "taking almost no time at all" 😉

@peterallenwebb
Copy link
Contributor Author

@dbeatty10 Yes, your summarization is completely correct. @iknox-fa, since you've likely spent a lot more time with this code than me, can you let me know what you think about this change? If you think it looks good, I'll plan to merge soon.

@dbeatty10 dbeatty10 self-requested a review January 25, 2023 21:11
Copy link
Contributor

@dbeatty10 dbeatty10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterallenwebb I didn't see a changelog entry or an associated GitHub issue.

Is there a reason we wouldn't want to do both of those here?

@peterallenwebb
Copy link
Contributor Author

@peterallenwebb I didn't see a changelog entry or an associated GitHub issue.

Is there a reason we wouldn't want to do both of those here?

No reason. Since I'm new to this part of the code, I just wanted a sanity check before I went to the trouble, but I'll create those tomorrow.

@peterallenwebb peterallenwebb requested a review from a team as a code owner January 26, 2023 19:02
@peterallenwebb peterallenwebb dismissed dbeatty10’s stale review January 26, 2023 19:08

Added changelog entry, and there is now an issue.

Copy link
Contributor

@dbeatty10 dbeatty10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@peterallenwebb peterallenwebb merged commit c2c4757 into main Jan 26, 2023
@peterallenwebb peterallenwebb deleted the paw/find-new-additions-perf branch January 26, 2023 19:27
@jtcohen6 jtcohen6 mentioned this pull request Feb 7, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1927] [Feature] Consider only successors of the deleted nodes when populating the GraphQueue
2 participants