Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Functions Preview #14668

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

findepi
Copy link
Member

@findepi findepi commented Feb 14, 2025

This is currently a preview PR for #12635 to share status of the work.

Based on add_one function, this creates add_one_udf (by convention, configurable):

#[excalibur_function]
fn add_one(a: u64) -> u64 {
    a + 1
}

The generated ScalarUDFImpl is fully functional, with signature derived from the Rust-level signature of the decorated function:

#[test]
fn simple_function_signature() {
    let udf = add_one_udf();
    assert_eq!(udf.name(), "add_one");

    assert_eq!(
        udf.signature(),
        &Signature::coercible(
            vec![TypeSignatureClass::Native(Arc::new(NativeType::UInt64))],
            Volatility::Immutable
        )
    );
    let return_type = udf.return_type(&[DataType::UInt64]).unwrap();
    assert_eq!(return_type, DataType::UInt64);
}

#[test]
fn simple_function_invoke_array() {
    let udf = add_one_udf();

    let invoke_args = vec![ColumnarValue::Array(Arc::new(UInt64Array::from(vec![
        1000, 2000, 3000, 4000, 5000,
    ])))];
    let ColumnarValue::Array(result_array) = udf
        .invoke_with_args(ScalarFunctionArgs {
            args: invoke_args,
            number_rows: 5,
            return_type: &DataType::UInt64,
        })
        .unwrap()
    else {
        panic!("Expected array result");
    };

    assert_eq!(
        &*result_array,
        &UInt64Array::from(vec![1001, 2001, 3001, 4001, 5001])
    );
}

#[test]
fn simple_function_invoke_scalar() {
    let udf = add_one_udf();

    let invoke_args = vec![ColumnarValue::Scalar(ScalarValue::UInt64(Some(1000)))];
    let ColumnarValue::Array(result_array) = udf
        .invoke_with_args(ScalarFunctionArgs {
            args: invoke_args,
            number_rows: 1,
            return_type: &DataType::UInt64,
        })
        .unwrap()
    else {
        panic!("Expected array result");
    };

    assert_eq!(&*result_array, &UInt64Array::from(vec![1001]));
}

@findepi findepi marked this pull request as draft February 14, 2025 15:40
@findepi findepi mentioned this pull request Feb 14, 2025
@findepi
Copy link
Member Author

findepi commented Feb 14, 2025

i now also have support for

various numeric

#[excalibur_function]
fn add(a: i32, b: u32) -> i64 {
    a as i64 + b as i64
}

nullable function arguments

#[excalibur_function]
fn first_non_null(a: Option<i32>, b: Option<i32>) -> Option<i32> {
    a.or(b)
}

nullable results

#[excalibur_function]
fn try_div(a: i32, b: i32) -> Option<i32> {
    if b == 0 {
        None
    } else {
        Some(a / b)
    }
}

functions that can throw

#[excalibur_function]
fn maybe_fail(fail: bool) -> Result<bool> {
    if fail {
        exec_err!("This test function just failed")
    } else {
        Ok(true)
    }
}

and string arguments

#[excalibur_function]
fn character_length(s: &str) -> u64 {
    s.chars().count() as u64
}

will update this PR soon (updated)

@findepi findepi force-pushed the findepi/simple-functions-pr branch from 55bbf7a to d66b8c6 Compare February 14, 2025 21:54
@github-actions github-actions bot added the development-process Related to development process of DataFusion label Feb 14, 2025
@findepi findepi force-pushed the findepi/simple-functions-pr branch from d66b8c6 to 55a02e6 Compare February 14, 2025 21:57
@github-actions github-actions bot removed the development-process Related to development process of DataFusion label Feb 14, 2025
@jayzhan211
Copy link
Contributor

#[excalibur_function]
fn add(a: i32, b: u32) -> i64 {
    a as i64 + b as i64
}

Is it possible to define a function using generics? Otherwise, will we need to duplicate the function for different types?

@findepi
Copy link
Member Author

findepi commented Feb 16, 2025

Is it possible to define a function using generics?

currently not.

Otherwise, will we need to duplicate the function for different types?

currently yes; which is easy with a macro.
of course it would be nicer to support such functions directly. I don't know yet how to do this.

@findepi
Copy link
Member Author

findepi commented Feb 18, 2025

@findepi findepi force-pushed the findepi/simple-functions-pr branch 2 times, most recently from 39af19b to 73efaf3 Compare February 18, 2025 11:37
@findepi findepi force-pushed the findepi/simple-functions-pr branch from 73efaf3 to a277aa0 Compare February 18, 2025 16:52
@findepi
Copy link
Member Author

findepi commented Feb 18, 2025

Added support for out types:

#[excalibur_function]
fn concat(a: &str, b: &str, out: &mut impl std::fmt::Write) -> Result<ValuePresence> {
    if a.is_empty() && b.is_empty() {
        // Be like Oracle
        return Ok(ValuePresence::Null);
    }
    out.write_str(a)?;
    out.write_str(b)?;
    Ok(ValuePresence::Value)
}

@findepi findepi marked this pull request as ready for review February 20, 2025 07:23
@findepi
Copy link
Member Author

findepi commented Feb 20, 2025

Marked this as non-draft. It would be great to have reviews here.

@findepi
Copy link
Member Author

findepi commented Feb 20, 2025

I bencharked

#[excalibur_function]
fn character_length(s: &str) -> i32 {
    s.chars().count() as i32
}

comparing it with the character_length implementation we ship today

#[derive(Debug)]
pub struct CharacterLengthFunc {
signature: Signature,
aliases: Vec<String>,
}
impl Default for CharacterLengthFunc {
fn default() -> Self {
Self::new()
}
}
impl CharacterLengthFunc {
pub fn new() -> Self {
use DataType::*;
Self {
signature: Signature::uniform(
1,
vec![Utf8, LargeUtf8, Utf8View],
Volatility::Immutable,
),
aliases: vec![String::from("length"), String::from("char_length")],
}
}
}
impl ScalarUDFImpl for CharacterLengthFunc {
fn as_any(&self) -> &dyn Any {
self
}
fn name(&self) -> &str {
"character_length"
}
fn signature(&self) -> &Signature {
&self.signature
}
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
utf8_to_int_type(&arg_types[0], "character_length")
}
fn invoke_batch(
&self,
args: &[ColumnarValue],
_number_rows: usize,
) -> Result<ColumnarValue> {
make_scalar_function(character_length, vec![])(args)
}
fn aliases(&self) -> &[String] {
&self.aliases
}
fn documentation(&self) -> Option<&Documentation> {
self.doc()
}
}
/// Returns number of characters in the string.
/// character_length('josé') = 4
/// The implementation counts UTF-8 code points to count the number of characters
fn character_length(args: &[ArrayRef]) -> Result<ArrayRef> {
match args[0].data_type() {
DataType::Utf8 => {
let string_array = args[0].as_string::<i32>();
character_length_general::<Int32Type, _>(string_array)
}
DataType::LargeUtf8 => {
let string_array = args[0].as_string::<i64>();
character_length_general::<Int64Type, _>(string_array)
}
DataType::Utf8View => {
let string_array = args[0].as_string_view();
character_length_general::<Int32Type, _>(string_array)
}
_ => unreachable!("CharacterLengthFunc"),
}
}
fn character_length_general<'a, T, V>(array: V) -> Result<ArrayRef>
where
T: ArrowPrimitiveType,
T::Native: OffsetSizeTrait,
V: StringArrayType<'a>,
{
let mut builder = PrimitiveBuilder::<T>::with_capacity(array.len());
// String characters are variable length encoded in UTF-8, counting the
// number of chars requires expensive decoding, however checking if the
// string is ASCII only is relatively cheap.
// If strings are ASCII only, count bytes instead.
let is_array_ascii_only = array.is_ascii();
if array.null_count() == 0 {
if is_array_ascii_only {
for i in 0..array.len() {
let value = array.value(i);
builder.append_value(T::Native::usize_as(value.len()));
}
} else {
for i in 0..array.len() {
let value = array.value(i);
builder.append_value(T::Native::usize_as(value.chars().count()));
}
}
} else if is_array_ascii_only {
for i in 0..array.len() {
if array.is_null(i) {
builder.append_null();
} else {
let value = array.value(i);
builder.append_value(T::Native::usize_as(value.len()));
}
}
} else {
for i in 0..array.len() {
if array.is_null(i) {
builder.append_null();
} else {
let value = array.value(i);
builder.append_value(T::Native::usize_as(value.chars().count()));
}
}
}
Ok(Arc::new(builder.finish()) as ArrayRef)
}

I don't have expectations that generic implementation will be faster than 50x1 longer, hand-written and optimized code.
Rather, I wanted to check what can be a performance difference between simple 4-liner implementation and one that has dedicated code paths for ascii/unicode, nulls/non-nulls and Utf8/Utf8View combinations. I.e. anything but not "simple".

The standard implementation is a bit faster at short strings image
The generic implementation is a bit faster at long ASCII strings image

Footnotes

  1. including unit tests code, which is natural necessity in a non-trivial implementation, but excluding docs

@jayzhan211
Copy link
Contributor

I don't have expectations that generic implementation will be faster than 50x1 longer, hand-written and optimized code.
Rather, I wanted to check what can be a performance difference between simple 4-liner implementation and one that has dedicated code paths for ascii/unicode, nulls/non-nulls and Utf8/Utf8View combinations. I.e. anything but not "simple".

Interesting, I will review it when I have time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants