Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating parquet table partitioned on a boolean column throws error #4545

Closed
georgecwan opened this issue Sep 25, 2023 · 2 comments · Fixed by #4547
Closed

Creating parquet table partitioned on a boolean column throws error #4545

georgecwan opened this issue Sep 25, 2023 · 2 comments · Fixed by #4547
Assignees
Labels
bug Something isn't working core Core development tasks parquet Related to the Parquet integration python python-server-side

Comments

@georgecwan
Copy link
Contributor

Description

Creating a parquet table partitioned on a column of boolean type and then running j_object.getColumnSources() will cause an exception.

Steps to reproduce

  1. Run the following code snippet:
from deephaven import empty_table

part = empty_table(4).update("II=ii")

from deephaven.parquet import write, read

write(part, "/tmp/bool-test/boolCol=true/part.parquet")
write(part, "/tmp/bool-test/boolCol=false/part.parquet")

bool_partition = read("/tmp/bool-test")
  1. Run bool_partition.j_object.getColumnSources()
@georgecwan georgecwan added bug Something isn't working triage labels Sep 25, 2023
@rcaudy rcaudy added core Core development tasks python parquet Related to the Parquet integration python-server-side and removed triage labels Sep 25, 2023
@rcaudy rcaudy added this to the September 2023 milestone Sep 25, 2023
@rcaudy
Copy link
Member

rcaudy commented Sep 25, 2023

First step, we need to understand if the issue is with CSV-based type-inference or with the Python code for adapting definitions.

@niloc132
Copy link
Member

I think this is a Java bug - readPartitionTableInferSchema always unboxes primitives instead of handling boolean specially:

allColumns.add(ColumnDefinition.fromGenericType(partitionKey,
getUnboxedTypeIfBoxed(partitionValue.getClass()), null, ColumnDefinition.ColumnType.Partitioning));

Contrast with ParquetMetadataFileLayout's adjustPartitionDefinition, which special cases boolean:

// Primitive booleans should be boxed
final Class<?> dataType = columnDefinition.getDataType();
if (dataType == boolean.class) {
return ColumnDefinition.fromGenericType(
columnDefinition.getName(), Boolean.class, null, ColumnDefinition.ColumnType.Partitioning);
}
// Non-boolean primitives and boxed Booleans are supported as-is
if (dataType.isPrimitive() || dataType == Boolean.class) {
return columnDefinition.withPartitioning();
}
// Non-boolean boxed primitives should be unboxed
final Class<?> unboxedType = TypeUtils.getUnboxedTypeIfBoxed(dataType);
if (unboxedType != dataType) {
return ColumnDefinition.fromGenericType(
columnDefinition.getName(), unboxedType, null, ColumnDefinition.ColumnType.Partitioning);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core development tasks parquet Related to the Parquet integration python python-server-side
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants