Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sort for Decimal data type #1487

Merged
merged 1 commit into from
Mar 28, 2022
Merged

Support sort for Decimal data type #1487

merged 1 commit into from
Mar 28, 2022

Conversation

yjshen
Copy link
Member

@yjshen yjshen commented Mar 25, 2022

Which issue does this PR close?

Closes #1137 .

Rationale for this change

We already have sort_decimal but fail to sort by decimal columns in DataFusion.

What changes are included in this PR?

Support decimal type for build_compare.

Are there any user-facing changes?

No.

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liukun4515
Copy link
Contributor

and link the pr to the issue #1137

@liukun4515
Copy link
Contributor

liukun4515 commented Mar 26, 2022

@yjshen Can you provide sql in the datafusion side?
I find the sorting of decimal works on my side.

❯ \d food
+---------------+--------------+------------+-------------+-----------------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type       | is_nullable |
+---------------+--------------+------------+-------------+-----------------+-------------+
| datafusion    | public       | food       | a           | Decimal(10, 5)  | NO          |
| datafusion    | public       | food       | b           | Decimal(20, 15) | NO          |
| datafusion    | public       | food       | c           | Boolean         | NO          |
+---------------+--------------+------------+-------------+-----------------+-------------+

❯ select * from food;
+---------+-------------------+-------+
| a       | b                 | c     |
+---------+-------------------+-------+
| 0.00001 | 0.000000000001000 | true  |
| 0.00002 | 0.000000000002000 | false |
| 0.00002 | 0.000000000002000 | false |
| 0.00003 | 0.000000000003000 | true  |
| 0.00003 | 0.000000000003000 | true  |
| 0.00003 | 0.000000000003000 | true  |
| 0.00004 | 0.000000000004000 | false |
| 0.00004 | 0.000000000004000 | false |
| 0.00004 | 0.000000000004000 | false |
| 0.00004 | 0.000000000004000 | false |
| 0.00005 | 0.000000000005000 | true  |
| 0.00005 | 0.000000000005000 | true  |
| 0.00005 | 0.000000000005000 | true  |
| 0.00005 | 0.000000000005000 | true  |
| 0.00005 | 0.000000000005000 | true  |
+---------+-------------------+-------+


sort data by desc.

❯ select * from food order by a desc;
+---------+-------------------+-------+
| a       | b                 | c     |
+---------+-------------------+-------+
| 0.00005 | 0.000000000005000 | true  |
| 0.00005 | 0.000000000005000 | true  |
| 0.00005 | 0.000000000005000 | true  |
| 0.00005 | 0.000000000005000 | true  |
| 0.00005 | 0.000000000005000 | true  |
| 0.00004 | 0.000000000004000 | false |
| 0.00004 | 0.000000000004000 | false |
| 0.00004 | 0.000000000004000 | false |
| 0.00004 | 0.000000000004000 | false |
| 0.00003 | 0.000000000003000 | true  |
| 0.00003 | 0.000000000003000 | true  |
| 0.00003 | 0.000000000003000 | true  |
| 0.00002 | 0.000000000002000 | false |
| 0.00002 | 0.000000000002000 | false |
| 0.00001 | 0.000000000001000 | true  |
+---------+-------------------+-------+

@liukun4515
Copy link
Contributor

The test data is from aggregate_simple.csv in the datafusion.

@yjshen
Copy link
Member Author

yjshen commented Mar 26, 2022

It's been used in two places in DataFusion:
InMemorySorter when we have combined all buffered batches into one single batch and call arrow::compute::lexsort_to_indices with more than one sort column. https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/sorts/sort.rs#L542

SortPreservingMerge while we are combing several partial orders into one final order, and during constructing SortKeyCursor https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/sorts/mod.rs#L188

@yjshen
Copy link
Member Author

yjshen commented Mar 26, 2022

#[tokio::test]
async fn csv_query_with_decimal_by_sql() -> Result<()> {
    let mut ctx = SessionContext::new();
    register_simple_aggregate_csv_with_decimal_by_sql(&mut ctx).await;
    let sql = "SELECT * from aggregate_simple order by c1, c2";
    let actual = execute_to_batches(&ctx, sql).await;
...
}
thread 'sql::select::csv_query_with_decimal_by_sql' panicked at 'Executing physical plan for 'SELECT * from aggregate_simple order by c1, c2': SortExec
{ input: CoalescePartitionsExec { input: ProjectionExec { expr: [(Column { name: "c1", index: 0 }, "c1"), (Column { name:  ............ sicalSortExpr { expr: Column { name: "c2", index: 1 }, options: SortOptions { descending: false, nulls_first: false } }], metrics_set: CompositeMetricsSet 
{ mid: ExecutionPlanMetricsSet { inner: Mutex { data: MetricsSet { metrics: [] } } }, final_: ExecutionPlanMetricsSet { inner: Mutex { data: MetricsSet { metrics: [] } } } }, preserve_partitioning: false }: 
ArrowError(InvalidArgumentError("The data type type Decimal(10, 6) has no natural order"))', datafusion/tests/sql/mod.rs:547:49

The error message is too long, so I cut the inner info

(Decimal(_, _), Decimal(_, _)) => {
let left: DecimalArray = DecimalArray::from(left.data().clone());
let right: DecimalArray = DecimalArray::from(right.data().clone());
Box::new(move |i, j| left.value(i).cmp(&right.value(j)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the precision and scale of left and right are not the same? In that case I don't think calling cmp on the value is correct

If precision and scale are always the same, maybe we can add an assert_eq! to make that clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's validated in the beginning of this function

pub fn build_compare(left: &dyn Array, right: &dyn Array) -> Result<DynComparator> {
    use DataType::*;
    use IntervalUnit::*;
    use TimeUnit::*;
    Ok(match (left.data_type(), right.data_type()) {
        (a, b) if a != b => {
            return Err(ArrowError::InvalidArgumentError(
                "Can't compare arrays of different types".to_string(),
            ));
        }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I assert it again here?

@alamb alamb merged commit 00accc7 into apache:master Mar 28, 2022
@alamb alamb changed the title Support sort for decimal data type Support sort for Decimal data type Mar 31, 2022
@alamb alamb added the arrow Changes to the arrow crate label Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support DecimalArray in sort kernel
3 participants