Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add array_dot_product / list_dot_product function #12476

Closed
wants to merge 4 commits into from

Conversation

austin362667
Copy link
Contributor

@austin362667 austin362667 commented Sep 15, 2024

Which issue does this PR close?

Closes #12475.

Rationale for this change

Add dot product functionality to DataFusion. It would be valuable to add scalar UDF array_dot_product / list_dot_product which computes inner product of two arrays, that is already supported by well-known DBs like DuckDB.

What changes are included in this PR?

  • Re-organize convert_to_f64_array to functions-nested/utils.rs.
  • Add array_dot_product / list_dot_product in functions-nested.
  • Add SLT in array.slt.
  • Update corresponding scalar UDF docs.

Are these changes tested?

Yes, added some array-specific SQL logic test, including List/LargeList/FixedSizedList

Are there any user-facing changes?

Yes, new function array_dot_product(arr1, arr2) is added.

For instance,

> CREATE TABLE word_embedding (
    emb_a DOUBLE[],
    emb_b DOUBLE[]
);
0 row(s) fetched.
Elapsed 0.008 seconds.

> INSERT INTO word_embedding VALUES
([1.0, 2.0, 3.0], [1.0, 2.0, 5.0]),
([2.0, 4.0, 6.0], [2.0, 4.0, 6.0]),
([1.5, 2.5, 3.5], [4.5, 6.5, 8.5]);
+-------+
| count |
+-------+
| 3     |
+-------+
1 row(s) fetched.
Elapsed 0.009 seconds.

> SELECT
    emb_a,
    emb_b,
    list_dot_product(emb_a, emb_b) AS inner_product
FROM
    word_embedding;
+-----------------+-----------------+---------------+
| emb_a           | emb_b           | inner_product |
+-----------------+-----------------+---------------+
| [1.0, 2.0, 3.0] | [1.0, 2.0, 5.0] | 20.0          |
| [2.0, 4.0, 6.0] | [2.0, 4.0, 6.0] | 56.0          |
| [1.5, 2.5, 3.5] | [4.5, 6.5, 8.5] | 52.75         |
+-----------------+-----------------+---------------+
3 row(s) fetched.
Elapsed 0.008 seconds.

Signed-off-by: Austin Liu <austin362667@gmail.com>
Signed-off-by: Austin Liu <austin362667@gmail.com>
Signed-off-by: Austin Liu <austin362667@gmail.com>
Signed-off-by: Austin Liu <austin362667@gmail.com>
@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Sep 15, 2024
@dharanad
Copy link
Contributor

Hey @austin362667 Maybe you should take a look at this discussion #12357 once

@austin362667
Copy link
Contributor Author

Thank you, @dharanad , for bringing this to my attention. This is a great discussion. I like the idea of keeping the DataFusion core as simple as possible while retaining useful DuckDB functions that enhance the user experience. I'm open to any feedback~

@alamb
Copy link
Contributor

alamb commented Sep 16, 2024

Thank you, @dharanad , for bringing this to my attention. This is a great discussion. I like the idea of keeping the DataFusion core as simple as possible while retaining useful DuckDB functions that enhance the user experience. I'm open to any feedback~

What would you think about creating a new crate in https://github.com/datafusion-contrib to hold additional duckdb functions? Perhaps https://github.com/datafusion-contrib/datafusion-functions-duckdb, similar to https://github.com/datafusion-contrib/datafusion-functions-json for JSON from @samuelcolvin and co.

It would be a pretty neat way to help build out the function library in DataFUsion

Also, @matthewmturner and I have been working on an integration UI similar to duckdb with many features -- https://github.com/datafusion-contrib/datafusion-dft -- we could then integrate these dft so it is easy to use

@austin362667
Copy link
Contributor Author

Sure, thank you Andrew for proposing this initiative.
I like the idea. Let's do it this way!!

@alamb
Copy link
Contributor

alamb commented Sep 17, 2024

Sure, thank you Andrew for proposing this initiative. I like the idea. Let's do it this way!!

Awesome -- let's try a new repo. Follow on discussion here: #12254 (comment)

@alamb alamb closed this Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support array_dot_product/list_dot_product
3 participants