-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Linked data benchmarks #451
Comments
I didn't hear of this benchmark before, thanks for referencing it! Sounds really cool/useful. I believe for graph processing you'll need (mostly) support for recursive CTEs, which is I guess quite a bit more work than CTEs themselves (which currently just references the query / logical plan) + union all (which just returns all the partitions of the plans). Do you happen to have some reference material on recursive CTEs? I think it would be very valuable to plan / add support for graph processing 👍 |
@Dandandan I'm not sure on the recursive |
For the LSQB here is the paper https://szarnyasg.github.io/tsmb-grades21/ms.pdf and a presentation https://docs.google.com/presentation/d/1pxyX_CWhFVYEttjTG2BrzuaMkEuLRxfhf5iX6n0leZI/mobilepresent?slide=id.gc6f9544c1_0_0 |
Thanks a lot again 👍 I think the challenging part with recursive CTE in DataFusion will be doing it efficiently with arrow data, as . Anti joins is another feature - but that I think should be relatively easy to add! |
In this case LSQB sounds to be a better first target. 👍
I have a bad experience with dedicated "graph engines", usually a PostgreSQL or SQL Server based solution beats any dedicated solution out there, so I wouldn't be afraid that DataFusion's architecture is not fully exploited. Similarly Differential Dataflow/Materialize or a naive rust/c++ implementation traversing the data is ridiculously faster so there is a chance that Arrow's memory model and parallel joins help. Still, adding benchmarks measuring recursive CTE might side-track the main DataFusion development, I acknowledge that. My gut feeling is that DataFusion would perform these queries relatively well as they would work as "repeated high selectivity, high cardinality joins" and as far as I remember we are not particularly bad at that. |
Yeah I believe joins are reasonably fast currently. I do need to do some comparisions (e.g. add the join queries to h2oai/db-benchmark#182) There are still some smaller tweaks that can be done and on the planning level some more can be done, such as:
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Recently I came across LDBC benchmarks which is focused on graph-like workloads. I'm wondering whether Datafusion already covers the features the queries need. While I don't think it's as important as TPC-H it'd increase the coverage helping to identify performance regressions during the Datafusion development. This would be an extra tool to get a broader picture in a structured way (at least more structured than ad-hoc queries)
Describe the solution you'd like
Supporting the queries written for PostgreSQL: https://github.com/ldbc/ldbc_snb_bi/tree/main/postgres/queries .
Describe alternatives you've considered
Not implementing it. Optimizing Datafusion to perform well on this particular benchmark is out of the scope as well. My assumption is that OLAP should be first-class and this should be a second class target.
Additional context
While it's not an OLAP workload, I believe Datafusion would perform relatively or extremely well.
Cc @Dandandan IIRC you contributed the most (CTE+UNION ALL) in this field
The text was updated successfully, but these errors were encountered: