feat(connect): `printSchema` #3617

andrewgazelka · 2024-12-19T10:53:12Z

TODO

should we reuse TreeDisplay?
remove unwraps

pub trait TreeDisplay {

should we make our own?
Example own impl that would need to be tested (don't look at seriously!)

pub fn to_tree_string(schema: &Schema) -> eyre::Result<String> {
    let mut output = String::new();
    // Start with root
    writeln!(&mut output, "root")?;
    // Now print each top-level field
    for (name, field) in &schema.fields {
        print_field(&mut output, name, &field.dtype, /*nullable*/ true, 1)?;
    }
    Ok(output)
}

// A helper function to print a field at a given level of indentation.
// level=1 means a single " |-- " prefix, level=2 means
// " |    |-- " and so on, mimicking Spark's indentation style.
fn print_field(
    w: &mut String, 
    field_name: &str, 
    dtype: &DataType, 
    nullable: bool, 
    level: usize
) -> eyre::Result<()> {
    // Construct the prefix for indentation.
    // Spark indentation levels:
    // level 1:  " |-- "
    // level 2:  " |    |-- "
    // level n:  " |" followed by (4*(n-1)) spaces + "-- "
    let indent = if level == 1 {
        format!(" |-- ")
    } else {
        let spaces = " ".repeat(4*(level-1));
        format!(" |{}-- ", spaces)
    };

    // Get a user-friendly string for dtype
    let dtype_str = type_to_string(dtype);

    writeln!(
        w,
        "{}{}: {} (nullable = {})",
        indent, field_name, dtype_str, nullable
    )?;

    // If the dtype is a struct, we must print its child fields with increased indentation.
    if let DataType::Struct(fields) = dtype {
        for field in fields {
            print_field(w, &field.name, &field.dtype, true, level + 1)?;
        }
    }

    Ok(())
}

fn type_to_string(dtype: &DataType) -> String {
    // We want a nice, human-readable type string.
    // Spark generally prints something like "integer", "string", etc.
    // We'll follow a similar style here:
    match dtype {
        DataType::Null => "null".to_string(),
        DataType::Boolean => "boolean".to_string(),
        DataType::Int8
        | DataType::Int16
        | DataType::Int32
        | DataType::Int64
        | DataType::UInt8
        | DataType::UInt16
        | DataType::UInt32
        | DataType::UInt64 => "integer".to_string(), // Spark doesn't differentiate sizes
        DataType::Float32 | DataType::Float64 => "double".to_string(), // Spark calls all floats double for printing
        DataType::Decimal128(_, _) => "decimal".to_string(),
        DataType::Timestamp(_, _) => "timestamp".to_string(),
        DataType::Date => "date".to_string(),
        DataType::Time(_) => "time".to_string(),
        DataType::Duration(_) => "duration".to_string(),
        DataType::Interval => "interval".to_string(),
        DataType::Binary => "binary".to_string(),
        DataType::FixedSizeBinary(_) => "fixed_size_binary".to_string(),
        DataType::Utf8 => "string".to_string(),
        DataType::FixedSizeList(_, _) => "array".to_string(), // Spark calls them arrays
        DataType::List(_) => "array".to_string(),
        DataType::Struct(_) => "struct".to_string(),
        DataType::Map { .. } => "map".to_string(),
        DataType::Extension(_, _, _) => "extension".to_string(),
        DataType::Embedding(_, _) => "embedding".to_string(),
        DataType::Image(_) => "image".to_string(),
        DataType::FixedShapeImage(_, _, _) => "fixed_shape_image".to_string(),
        DataType::Tensor(_) => "tensor".to_string(),
        DataType::FixedShapeTensor(_, _) => "fixed_shape_tensor".to_string(),
        DataType::SparseTensor(_) => "sparse_tensor".to_string(),
        DataType::FixedShapeSparseTensor(_, _) => "fixed_shape_sparse_tensor".to_string(),
        #[cfg(feature = "python")]
        DataType::Python => "python_object".to_string(),
        DataType::Unknown => "unknown".to_string(),
    }
}

andrewgazelka · 2024-12-19T10:53:32Z

feat(connect): printSchema #3617 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

codspeed-hq · 2024-12-19T11:11:31Z

CodSpeed Performance Report

Merging #3617 will improve performances by 77.66%

_{Comparing andrew/print-schema (f4a3051) with main (e59581c)}

Summary

⚡ 2 improvements
✅ 25 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`andrew/print-schema`	Change
⚡	`test_iter_rows_first_row[100 Small Files]`	209.1 ms	117.7 ms	+77.66%
⚡	`test_show[100 Small Files]`	23.8 ms	15.6 ms	+52.82%

codecov · 2024-12-19T17:28:31Z

Codecov Report

Attention: Patch coverage is 86.83386% with 42 lines in your changes missing coverage. Please review.

Project coverage is 77.65%. Comparing base (e59581c) to head (f4a3051).

Files with missing lines	Patch %	Lines
src/daft-connect/src/display.rs	87.63%	35 Missing ⚠️
src/daft-connect/src/lib.rs	80.00%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3617      +/-   ##
==========================================
- Coverage   77.99%   77.65%   -0.34%     
==========================================
  Files         720      721       +1     
  Lines       88794    91520    +2726     
==========================================
+ Hits        69252    71070    +1818     
- Misses      19542    20450     +908

Files with missing lines	Coverage Δ
src/daft-connect/src/translation/schema.rs	`100.00% <100.00%> (ø)`
src/daft-connect/src/lib.rs	`65.01% <80.00%> (+2.13%)`	⬆️
src/daft-connect/src/display.rs	`87.63% <87.63%> (ø)`

... and 14 files with indirect coverage changes

tests/connect/test_print_schema.py

src/daft-connect/src/display.rs

universalmind303 · 2024-12-19T17:37:01Z

src/daft-connect/src/display.rs

+        DataType::FixedShapeImage(_, _, _) => "fixed_shape_image".to_string(),
+        DataType::Tensor(_) => "tensor".to_string(),
+        DataType::FixedShapeTensor(_, _) => "fixed_shape_tensor".to_string(),
+        DataType::SparseTensor(_) => "sparse_tensor".to_string(),
+        DataType::FixedShapeSparseTensor(_, _) => "fixed_shape_sparse_tensor".to_string(),


i don't think these exist in spark (along with unsized ints). We should check if spark connect has a standard around extension or user defined types. If they don't I'd at least want something in the display to indicate that these are not native spark types, but in fact daft datatypes.

hmmmmm https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/UserDefinedType.html

can people even define UDTs outside of Java? I don't see a pyspark example

src/daft-connect/src/display.rs

universalmind303 · 2024-12-19T17:41:07Z

src/daft-connect/src/display.rs

+        DataType::Binary => "binary".to_string(),
+        DataType::FixedSizeBinary(_) => "fixed_size_binary".to_string(),
+        DataType::Utf8 => "string".to_string(),
+        DataType::FixedSizeList(_, _) => "array".to_string(), // Spark calls them arrays


i would represent this as a custom type similar to the other non native dtypes.

if there is no standard thoughts on something like daft[fixed_size_list]?

since there seems to be no standard, I'd prefer to separate them into 2 categories.

arrow native datatypes:

for arrow native datatypes such as unsigned integers,fsl, etc, lets go with arrow.<datatype> such as

u64 -> arrow.uint64,

fsl(u8, 1) -> arrow.fixed_size_list(1)\n --- element: arrow.u8

(this resembles how arrow does extension types)

custom daft datatypes:

for non arrow native ones such as SparseTensor, Image, and so on, let's prefix them with daft.

image -> daft.image(<image_mode>) ex: daft.image(RGB)

sparsetensor(u8) -> daft.sparse_tensor\n --- element: arrow.u8

should be done

src/daft-connect/Cargo.toml

tests/connect/test_print_schema.py

universalmind303 · 2024-12-19T21:23:40Z

src/daft-connect/src/display.rs

+        DataType::FixedSizeBinary(_) => "arrow.fixed_size_binary".to_string(),
+        DataType::Utf8 => "string".to_string(),
+        DataType::FixedSizeList(_, _) => "arrow.fixed_size_list".to_string(),
+        DataType::List(_) => "arrow.list".to_string(),
+        DataType::Struct(_) => "struct".to_string(),
+        DataType::Map { .. } => "map".to_string(),
+        DataType::Extension(_, _, _) => "daft.extension".to_string(),
+        DataType::Embedding(_, _) => "daft.embedding".to_string(),
+        DataType::Image(_) => "daft.image".to_string(),
+        DataType::FixedShapeImage(_, _, _) => "daft.fixed_shape_image".to_string(),
+        DataType::Tensor(_) => "daft.tensor".to_string(),
+        DataType::FixedShapeTensor(_, _) => "daft.fixed_shape_tensor".to_string(),
+        DataType::SparseTensor(_) => "daft.sparse_tensor".to_string(),
+        DataType::FixedShapeSparseTensor(_, _) => "daft.fixed_shape_sparse_tensor".to_string(),
+        #[cfg(feature = "python")]
+        DataType::Python => "daft.python".to_string(),
+        DataType::Unknown => "unknown".to_string(),
+        DataType::UInt8 => "arrow.ubyte".to_string(),
+        DataType::UInt16 => "arrow.ushort".to_string(),
+        DataType::UInt32 => "arrow.uint".to_string(),
+        DataType::UInt64 => "arrow.ulong".to_string(),


Sorry if I was unclear in my previous comment, but this is still not right.

arrow types should just be called what they are

DataType::UInt8 => "arrow.uint8".to_string(), DataType::UInt16 => "arrow.uint16".to_string(), DataType::UInt32 => "arrow.uint32".to_string(), DataType::UInt64 => "arrow.uint64".to_string(),

and nested datatypes should match how spark does them
for example, lists have the inner rendered as "element"

data = [{"a": [1,2,3], "b": "hello"}] spark.createDataFrame(data).printSchema()

root |-- a: array (nullable = true) | |-- element: long (containsNull = true) |-- b: string (nullable = true)

and for structs Struct{ints: i64, strings: utf8}

root |-- struct: struct (nullable = true) | |-- ints: integer (nullable = true) | |-- strings: string (nullable = true)

We'll also want to capture the parameters on them such as FixedSizeList(Int64, 1)

root |-- a: arrow.fixed_size_list (size = 1, nullable = true) | |-- element: long (containsNull = true)

or on Image(ImageMode::RGB)

root |-- a: daft.image (mode = RGB, nullable = true)

github-actions bot added the feat label Dec 19, 2024

andrewgazelka marked this pull request as ready for review December 19, 2024 16:31

andrewgazelka requested a review from universalmind303 December 19, 2024 16:31

andrewgazelka force-pushed the andrew/print-schema branch from 58c8a1e to 81f2540 Compare December 19, 2024 17:04

universalmind303 requested changes Dec 19, 2024

View reviewed changes

universalmind303 reviewed Dec 19, 2024

View reviewed changes

src/daft-connect/Cargo.toml Outdated Show resolved Hide resolved

universalmind303 reviewed Dec 19, 2024

View reviewed changes

tests/connect/test_print_schema.py Show resolved Hide resolved

andrewgazelka requested a review from universalmind303 December 19, 2024 20:58

universalmind303 reviewed Dec 19, 2024

View reviewed changes

andrewgazelka added 11 commits December 25, 2024 02:02

feat(connect): printSchema

1e13744

fix several comments

7f20107

str

8373615

make linter happy

7751273

redo tests

d22b416

fix schema tests

0458c03

rename test

b56ee6e

rmeove dev-dependencies

dbe090e

add nested data test

aca8275

skip test

109a903

improve naming

f4a3051

andrewgazelka force-pushed the andrew/print-schema branch from 75d1e43 to f4a3051 Compare December 25, 2024 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(connect): `printSchema` #3617

feat(connect): `printSchema` #3617

andrewgazelka commented Dec 19, 2024 •

edited

Loading

andrewgazelka commented Dec 19, 2024

codspeed-hq bot commented Dec 19, 2024 •

edited

Loading

codecov bot commented Dec 19, 2024 •

edited

Loading

universalmind303 Dec 19, 2024

andrewgazelka Dec 19, 2024

universalmind303 Dec 19, 2024

andrewgazelka Dec 19, 2024

universalmind303 Dec 19, 2024

andrewgazelka Dec 19, 2024

universalmind303 Dec 19, 2024 •

edited

Loading

feat(connect): printSchema #3617

Are you sure you want to change the base?

feat(connect): printSchema #3617

Conversation

andrewgazelka commented Dec 19, 2024 • edited Loading

TODO

andrewgazelka commented Dec 19, 2024

codspeed-hq bot commented Dec 19, 2024 • edited Loading

CodSpeed Performance Report

Merging #3617 will improve performances by 77.66%

Summary

Benchmarks breakdown

codecov bot commented Dec 19, 2024 • edited Loading

Codecov Report

universalmind303 Dec 19, 2024

Choose a reason for hiding this comment

andrewgazelka Dec 19, 2024

Choose a reason for hiding this comment

universalmind303 Dec 19, 2024

Choose a reason for hiding this comment

andrewgazelka Dec 19, 2024

Choose a reason for hiding this comment

universalmind303 Dec 19, 2024

Choose a reason for hiding this comment

arrow native datatypes:

custom daft datatypes:

andrewgazelka Dec 19, 2024

Choose a reason for hiding this comment

universalmind303 Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

feat(connect): `printSchema` #3617

feat(connect): `printSchema` #3617

andrewgazelka commented Dec 19, 2024 •

edited

Loading

codspeed-hq bot commented Dec 19, 2024 •

edited

Loading

codecov bot commented Dec 19, 2024 •

edited

Loading

universalmind303 Dec 19, 2024 •

edited

Loading