Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp_match skips first match when returning match #3803

Closed
Jefffrey opened this issue Mar 5, 2023 · 5 comments · Fixed by #3807
Closed

regexp_match skips first match when returning match #3803

Jefffrey opened this issue Mar 5, 2023 · 5 comments · Fixed by #3807
Labels
arrow Changes to the arrow crate bug

Comments

@Jefffrey
Copy link
Contributor

Jefffrey commented Mar 5, 2023

Describe the bug

In some cases regexp_match will skip first and only match.

e.g. if pattern is foo and string to match is foo then should return single match foo. Currently returning empty array for the match (correctly finds there is a match, but doesn't return the match correctly).

To Reproduce

Example test in arrow-string/src/regexp.rs

    #[test]
    fn sandbox() {
        let array = StringArray::from(vec![Some("foo")]);
        let pattern = GenericStringArray::<i32>::from(vec![r"foo"]);
        let actual = regexp_match(&array, &pattern, None).unwrap();
        let result = actual.as_any().downcast_ref::<ListArray>().unwrap();
        let elem_builder: GenericStringBuilder<i32> = GenericStringBuilder::new();
        let mut expected_builder = ListBuilder::new(elem_builder);
        expected_builder.values().append_value("foo");
        expected_builder.append(true);
        let expected = expected_builder.finish();
        assert_eq!(&expected, result);
    }

Will panic with:

thread 'regexp::tests::sandbox' panicked at 'assertion failed: `(left == right)`
  left: `ListArray
[
  StringArray
[
  "foo",
],
]`,
 right: `ListArray
[
  StringArray
[
],
]`', arrow-string/src/regexp.rs:277:9

Can see the right (actual) has empty StringArray[] whereas expected contains the match: StringArray["foo"]

Expected behavior

Test should succeed.

Additional context

Seems its because by default skipping the first match in a capture group:

match re.captures(value) {
Some(caps) => {
for m in caps.iter().skip(1).flatten() {
list_builder.values().append_value(m.as_str());
}
list_builder.append(true);
}
None => list_builder.append(false),
}

Where in the test example above, caps has value:

[arrow-string/src/regexp.rs:212] &caps = Captures(
    {
        0: Some(
            "foo",
        ),
    },
)

Relevant regex doc: https://docs.rs/regex/latest/regex/struct.Regex.html#method.captures

Specifically:

Capture group 0 always corresponds to the entire match.

Original issue: apache/datafusion#5479

@Jefffrey Jefffrey added the bug label Mar 5, 2023
@Jefffrey Jefffrey changed the title regexp_match skips first match regexp_match skips first match when returning match Mar 5, 2023
@tustvold
Copy link
Contributor

tustvold commented Mar 6, 2023

I'm not sure if I'm missing something but foo is not a capture group? If you instead change the pattern to contain a capture group, e.g. (foo) it works correctly?

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Mar 6, 2023

If parity with Postgres is desired, then this would be considered a bug. Relevant extract:

If a match is found, and the pattern contains no parenthesized subexpressions, then the result is a single-element text array containing the substring matching the whole pattern

From: https://www.postgresql.org/docs/current/functions-matching.html

Also it might be somewhat confusing as returning a not-null value in the output ListArray indicates a match was found (else it would be null instead of a StringArray), but resultant StringArray itself is empty without the match. The behaviour seems somewhat inconsistent?

@Weijun-H
Copy link
Member

Weijun-H commented Mar 6, 2023

I'm not sure if I'm missing something but foo is not a capture group? If you instead change the pattern to contain a capture group, e.g. (foo) it works correctly?

(foo) can work.

@tustvold
Copy link
Contributor

tustvold commented Mar 7, 2023

If parity with Postgres is desired, then this would be considered a bug

Makes sense, we should probably update the function's docs to match

@tustvold tustvold added the arrow Changes to the arrow crate label Mar 10, 2023
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow'} from #3807

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants