Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reading partitioned table with slash in partition key #1228

Closed
kvedes opened this issue Mar 17, 2023 · 14 comments · Fixed by #1613
Closed

Error reading partitioned table with slash in partition key #1228

kvedes opened this issue Mar 17, 2023 · 14 comments · Fixed by #1613
Labels
bug Something isn't working

Comments

@kvedes
Copy link

kvedes commented Mar 17, 2023

Environment

Delta-rs version: 0.8.1

Binding: python

Environment:

  • Azure and locally

Bug

What happened:
I'm saving a partitioned table in Databricks (DBR 12.1) into Azure Data Lake Storage V2 which is partitioned on a string column containing slashes. When trying to read the table using deltalake from my local pc I get an error saying there is 'No body' in the parquet file.

Traceback (most recent call last):
  File "[..]/delta_connect.py", line 32, in <module>
    table = dt.to_pandas(
  File "[..]/python3.10/site-packages/deltalake/table.py", line 418, in to_pandas
    return self.to_pyarrow_table(
  File "[..]/python3.10/site-packages/deltalake/table.py", line 400, in to_pyarrow_table
    return self.to_pyarrow_dataset(
  File "pyarrow/_dataset.pyx", line 369, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2818, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_open_input_file
  File "[..]/python3.10/site-packages/deltalake/fs.py", line 22, in open_input_file
    return pa.PythonFile(DeltaFileSystemHandler.open_input_file(self, path))
deltalake.PyDeltaTableError: Object at location partition_table_test/value=you%25252Fthere/part-00007-adfd81b1-1bff-4431-83c2-948e2274a892.c000.snappy.parquet not found: response error "No Body", after 0 retries: HTTP status client error (404 Not Found) for url [..]

If I query for a key without a slash it returns the data without issue.

What you expected to happen:
Succesfully read the data

How to reproduce it:
In Databricks saving to ADLS2:

df = spark.createDataFrame([
    (0, "hej"),
    (1, "you/there"),
    (2, "the two of us"),
], schema="id int, value string")
df.write.partitionBy("value").saveAsTable("mytable")

Locally running in python (3.10) gives the above error:

from deltalake import DeltaTable

container_name = "my_container"
storage_account_name = "my_sa"
table_name = "my_table"
sas_token = "my_sas_token"

abfs_path = (
    f"abfs://{container_name}@{storage_account_name}.dfs.core.windows.net/{table_name}/"
)

dt = DeltaTable(
    abfs_path,
    storage_options={
        "sas_key": sas_token,
    },
)

table = dt.to_pandas(
    partitions=[("value", "=", "you/there")],
)

Using the other value "hej" works fine:

table = dt.to_pandas(
    partitions=[("value", "=", "hej")],
)

returns:

    id value
0   0   hej

More details:

@kvedes kvedes added the bug Something isn't working label Mar 17, 2023
@volker48
Copy link

I'm hitting the same error with partitions that are timed based like the following: "2023-03-23 00:00:00". The paths in S3 are URL encoded so they look like this 2023-03-23 00%3A00%3A00 when viewed via aws s3 ls. So I don't think this is unique to just having slashes in the partition key.

@dudleydhsa
Copy link

dudleydhsa commented Mar 27, 2023

I have the same issue with timestamp as partitions. For example the timestamp is encoded in the list of files in the DeltaTable object.
edl_extract_dt=2022-11-11 13%253A00%253A01/part-00000-2aa47137-6e20-4cd7-a66e-fc5b6762158d.c000.snappy.parquet
This matches the symlink manifest paths.

When data is actually being read it seems the path is double encoded (the %) become encoded again to %25.
Returned by the error path not found:
edl_extract_dt=2022-11-11 13%25253A00%25253A01/part-00000-2aa47137-6e20-4cd7-a66e-fc5b6762158d.c000.snappy.parquet

So it appears to be a double encoding issue.

@mrjoe7
Copy link
Contributor

mrjoe7 commented May 1, 2023

This is caused when the path property is double encoded in ObjectMeta::try_from.

This is how table data looks on local drive (note: the file structure was created by pyspark):

> ls
_delta_log  gender=m%2Ff

Contents of JSON transaction log:

{"add":{"path":"gender=m%252Ff/part-00023-c9d275a0-a620-408b-889f-e4d7280dfad3.c000.snappy.parquet","partitionValues":{"gender":"m/f"},"size":511,"modificationTime":1682978128394,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"firstName\":\"Kristine\"},\"maxValues\":{\"firstName\":\"Kristine\"},\"nullCount\":{\"firstName\":0}}"}}

It gets correctly url decoded and stored in Add action
Screenshot_20230502_000705

But, then is url encoded when converted to Path in ObjectMeta here:
Screenshot_20230502_001313

wjones127 pushed a commit that referenced this issue May 3, 2023
# Description
This PR is fixing #1228.

# Related Issue(s)

-  #1228
@mattfysh
Copy link

The release notes say a fix has landed in v0.9.0, but I've tested both v0.9.0 and v0.10. and I'm still seeing this issue with S3.

Smallest reproducible I could come up with:

from deltalake.fs import DeltaStorageHandler

# setup storage_options using boto3

store = DeltaStorageHandler(
    "s3://my_bucket/path/to/delta",
    storage_options,
)

# Works
store.open_input_file(
    "p=x/part-00000-3209f4d9-01c8-4e49-a01b-c5f9a6f61262.c000.snappy.parquet"
)

# None of these work
store.open_input_file("p=x%253Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x%3Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")

@mattfysh
Copy link

cc @mrjoe7 @wjones127

@mrjoe7
Copy link
Contributor

mrjoe7 commented Jun 11, 2023

@mattfysh can you provide listing of your bucket so we can see how exacly the file you are trying to read is named in S3?

I was able to successfully read the file using Rust in the following example:

Create and upload file to S3:

# create new empty bucket
awslocal s3api create-bucket --bucket delta
# create dummy file
echo 'test' > part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
# uload it to local s3
awslocal s3 cp part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet s3://delta/path/to/delta/p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet

Read the file using Rust and verify the file contents:

use std::collections::HashMap;
use bytes::Bytes;
use object_store::ObjectStore;
use object_store::path::Path;
use url::Url;
use deltalake::storage::DeltaObjectStore;

#[tokio::main(flavor = "current_thread")]
async fn main() -> Result<(), deltalake::errors::DeltaTableError> {
    let store_url = Url::parse("s3://delta/path/to/delta").unwrap();
    dbg!(store_url.path());
    let store = DeltaObjectStore::try_new(
        store_url,
        HashMap::from([
            (
                "AWS_ACCESS_KEY_ID".to_string(),
                "TESTACCESSKEY12345".to_string(),
            ),
            (
                "AWS_SECRET_ACCESS_KEY".to_string(),
                "ABCSECRETKEY".to_string(),
            ),
            ("AWS_REGION".to_string(), "us-east-1".to_string()),
            (
                "AWS_ENDPOINT_URL".to_string(),
                "http://localhost:4566".to_string(),
            ),
            ("AWS_STORAGE_ALLOW_HTTP".to_string(), "TRUE".to_string()),
        ]),
    )
        .unwrap();

    let store_get_res = store.get(&Path::from("p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")).await.unwrap();
    let data_res = store_get_res.bytes().await.unwrap();
    assert_eq!(Bytes::from("test\n".to_owned()), data_res);

    Ok(())
}

@mattfysh
Copy link

mattfysh commented Jun 11, 2023 via email

@mattfysh
Copy link

mattfysh commented Jun 11, 2023

Ran some tests and it looks like having the partition column value being percent-encoded is what leads to the issue, but even in that case I'm at a loss to understand why test 5 is not working. Note that each file "part number" changes based on the partition folder name

In the meantime, I'll try to see why my databricks job writes partition column values in percent-encoding, but I'm inclined to believe it's doing so intentionally

Setup:

BUCKET_NAME="delta-rs-read-test"
aws s3api create-bucket --bucket $BUCKET --region us-east-1
echo 'test' > stub.pq
aws s3 cp stub.pq "s3://$BUCKET_NAME/path/to/delta/p=x/part-00000-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet"
aws s3 cp stub.pq "s3://$BUCKET_NAME/path/to/delta/p=x:y/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet"
aws s3 cp stub.pq "s3://$BUCKET_NAME/path/to/delta/p=x%3Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet"

Tests:

store = DeltaStorageHandler("s3://delta-rs-read-test/path/to/delta", storage_options)
# Tests 1-2 work
store.open_input_file("p=x/part-00000-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x:y/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
# Tests 3-6 not working
store.open_input_file("p=x%3Ay/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x%253Ay/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
    # Test 5 uses same literal path that appears on S3, and should work given that 1 & 2 are both working
store.open_input_file("p=x%3Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x%253Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
image

@mattfysh
Copy link

mattfysh commented Jul 9, 2023

Hey @mrjoe7 - any thoughts on why test 5 doesn't behave the same as test 1 & 2? I am using the same literal path value present in S3

@mrjoe7
Copy link
Contributor

mrjoe7 commented Jul 9, 2023

Hi @mattfysh. It's because how some characters are encoded by object_store that delta-rs is using internally.

This is what each of your test paths will get encoded into (not including the /path/to/delta for better readability):

  1. p=x/part-00000-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
  2. p=x:y/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
  3. p=x%253Ay/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
  4. p=x%25253Ay/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
  5. p=x%253Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
  6. p=x%25253Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet

As you can see : is not encoded into %3A, but if you include % character in the path it will be encoded into %25.
Consequence of that is that it's impossible to reach a file that has path containing %3A.

@mattfysh
Copy link

Thanks @mrjoe7 - I was wondering if there is any way to change this behaviour? Because the default behaviour of Databricks (spark/hadoop) is to percent-encode these partition column values. I understand they do this to support Windows machines. Here is the spark job I ran to produce the structure on S3:

# Define the data with specific rows
data = [
    (1, "part1", 100),
    (2, "part2:with_colon", 200),
    (3, "part3", 300),
]

# Define the column names
columns = ["id", "partition_column", "value"]

# Create a DataFrame with specific data
df = spark.createDataFrame(data, columns)

# Write the DataFrame to an S3 bucket in Delta format
s3_bucket_path = "s3a://mysandbox/deltabug"
df.write.format("delta").partitionBy("partition_column").save(s3_bucket_path)

Then when i try to query this table with delta-rs, it fails due to this encoding issue

@mattfysh
Copy link

It looks like there may be an option when using object_store to prevent it from re-encoding the percent sign: apache/arrow-rs#3651

I don't know Rust very well, but could this be something that can be surfaced as an option into delta-rs and the python package?

@caseyrathbone
Copy link

I am able to reproduce with a simple local example using the Python library. It appears the space is double encoded translating to %2520 instead of just %20

import datetime
from deltalake import write_deltalake
import pyarrow as pa

data = pa.table({"data": pa.array(["mydata"]),
                 "inserted_at": pa.array([datetime.datetime.now()]),
                 "partition_column": pa.array(["hello world"])})

write_deltalake(table_or_uri="./unqueryable_table", \
  mode="append", \
  data=data, \
  partition_by=["partition_column"]
)

Output

> tree unqueryable_table
unqueryable_table
├── _delta_log
│   └── 00000000000000000000.json
└── partition_column=hello%2520world
    └── 0-8a13288c-f252-43f8-9c2c-c38416e7296c-0.parquet

2 directories, 2 files

@ikstewa
Copy link

ikstewa commented Sep 20, 2023

We're still encountering issues when trying to interface with Spark Databricks. We see different partition formats from different clients/libraries.

Re-opened in: #1651

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
7 participants