-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graph Analysis Optimization for Large Dags #6720
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it takes almost no time at all
🎉
This looks awesome @peterallenwebb 🧠
If I'm understanding correctly, the original implementation would have looked through all four of these nodes, no matter what.
In your new implementation, if Node 2 is removed, then only its successors (Nodes 3 and 4) will be visited, and only Node 3 will change to an in-degree of zero.
We can see that before it would visit all X nodes in the average case. But now the average case is just the typical number of successors (i.e., the mean number of out-degrees per node).
I would guess for most dbt projects, the mean number of out-degrees will be WAAAAY closer to 0 than it will be to X, hence "taking almost no time at all" 😉
@dbeatty10 Yes, your summarization is completely correct. @iknox-fa, since you've likely spent a lot more time with this code than me, can you let me know what you think about this change? If you think it looks good, I'll plan to merge soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterallenwebb I didn't see a changelog entry or an associated GitHub issue.
Is there a reason we wouldn't want to do both of those here?
No reason. Since I'm new to this part of the code, I just wanted a sanity check before I went to the trouble, but I'll create those tomorrow. |
Added changelog entry, and there is now an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
resolves #6759
Description
This optimization reduces by 25% the time needed to run dbt compile in a large scenario provided to me by @boxysean. In that scenario, dbt originally spent 25% of its time in the GraphQueue._find_new_additions() function, which has been modified in the PR so that it takes almost no time at all.
The optimization works because the the in-degree of a node in the GraphQueue's graph only changes when we remove a node, and the only nodes whose in-degree will change are those that were successors of the deleted node. Instead of looking at every node in the graph to see whether its in-degree has changed to zero, we now look only at successors of the deleted node.
Checklist
changie new
to create a changelog entry