Add Linked data benchmarks #451

alippai · 2021-05-30T21:50:08Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Recently I came across LDBC benchmarks which is focused on graph-like workloads. I'm wondering whether Datafusion already covers the features the queries need. While I don't think it's as important as TPC-H it'd increase the coverage helping to identify performance regressions during the Datafusion development. This would be an extra tool to get a broader picture in a structured way (at least more structured than ad-hoc queries)

Describe the solution you'd like
Supporting the queries written for PostgreSQL: https://github.com/ldbc/ldbc_snb_bi/tree/main/postgres/queries .

Describe alternatives you've considered
Not implementing it. Optimizing Datafusion to perform well on this particular benchmark is out of the scope as well. My assumption is that OLAP should be first-class and this should be a second class target.

Additional context
While it's not an OLAP workload, I believe Datafusion would perform relatively or extremely well.

Cc @Dandandan IIRC you contributed the most (CTE+UNION ALL) in this field

Dandandan · 2021-05-31T18:49:27Z

I didn't hear of this benchmark before, thanks for referencing it! Sounds really cool/useful.

I believe for graph processing you'll need (mostly) support for recursive CTEs, which is I guess quite a bit more work than CTEs themselves (which currently just references the query / logical plan) + union all (which just returns all the partitions of the plans).

Do you happen to have some reference material on recursive CTEs?

I think it would be very valuable to plan / add support for graph processing 👍

alippai · 2021-05-31T19:28:09Z

@Dandandan I'm not sure on the recursive
CTE implementatiomm part, however PostgreSQL has a brief description on the algorithm https://www.postgresql.org/docs/current/queries-with.html#QUERIES-WITH-SELECT . You are right that some queries need pretty complex features like subquerries, window functions, date handling, generate_subscripts. A more lightweight version of the benchmark can be found as well: https://github.com/ldbc/lsqb/ this focuses on various joins (join, antijoin, outer join) instead of the "recursive" workload.

alippai · 2021-05-31T19:31:40Z

For the LSQB here is the paper https://szarnyasg.github.io/tsmb-grades21/ms.pdf and a presentation https://docs.google.com/presentation/d/1pxyX_CWhFVYEttjTG2BrzuaMkEuLRxfhf5iX6n0leZI/mobilepresent?slide=id.gc6f9544c1_0_0

Dandandan · 2021-05-31T20:08:44Z

Thanks a lot again 👍

I think the challenging part with recursive CTE in DataFusion will be doing it efficiently with arrow data, as .
So also into what vectorized engines (can) do here.
It might probably not always possible to do so in that case we should have some thing that efficiently does row by row processing.

Anti joins is another feature - but that I think should be relatively easy to add!

alippai · 2021-05-31T20:28:58Z

In this case LSQB sounds to be a better first target. 👍

So also into what vectorized engines (can) do here.

I have a bad experience with dedicated "graph engines", usually a PostgreSQL or SQL Server based solution beats any dedicated solution out there, so I wouldn't be afraid that DataFusion's architecture is not fully exploited. Similarly Differential Dataflow/Materialize or a naive rust/c++ implementation traversing the data is ridiculously faster so there is a chance that Arrow's memory model and parallel joins help. Still, adding benchmarks measuring recursive CTE might side-track the main DataFusion development, I acknowledge that. My gut feeling is that DataFusion would perform these queries relatively well as they would work as "repeated high selectivity, high cardinality joins" and as far as I remember we are not particularly bad at that.

Dandandan · 2021-05-31T20:57:38Z

Yeah I believe joins are reasonably fast currently. I do need to do some comparisions (e.g. add the join queries to h2oai/db-benchmark#182)

There are still some smaller tweaks that can be done and on the planning level some more can be done, such as:

Implement a better hash join reordering algorithm
Improve planning based on size of tables / expected nr. of rows

alippai added the enhancement New feature or request label May 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Linked data benchmarks #451

Add Linked data benchmarks #451

alippai commented May 30, 2021

Dandandan commented May 31, 2021

alippai commented May 31, 2021

alippai commented May 31, 2021

Dandandan commented May 31, 2021

alippai commented May 31, 2021 •

edited

Loading

Dandandan commented May 31, 2021

Add Linked data benchmarks #451

Add Linked data benchmarks #451

Comments

alippai commented May 30, 2021

Dandandan commented May 31, 2021

alippai commented May 31, 2021

alippai commented May 31, 2021

Dandandan commented May 31, 2021

alippai commented May 31, 2021 • edited Loading

Dandandan commented May 31, 2021

alippai commented May 31, 2021 •

edited

Loading