-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error reading partitioned table with slash in partition key #1228
Comments
I'm hitting the same error with partitions that are timed based like the following: |
I have the same issue with timestamp as partitions. For example the timestamp is encoded in the list of files in the DeltaTable object. When data is actually being read it seems the path is double encoded (the %) become encoded again to %25. So it appears to be a double encoding issue. |
This is caused when the This is how table data looks on local drive (note: the file structure was created by pyspark):
Contents of JSON transaction log: {"add":{"path":"gender=m%252Ff/part-00023-c9d275a0-a620-408b-889f-e4d7280dfad3.c000.snappy.parquet","partitionValues":{"gender":"m/f"},"size":511,"modificationTime":1682978128394,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"firstName\":\"Kristine\"},\"maxValues\":{\"firstName\":\"Kristine\"},\"nullCount\":{\"firstName\":0}}"}} It gets correctly url decoded and stored in But, then is url encoded when converted to |
The release notes say a fix has landed in v0.9.0, but I've tested both v0.9.0 and v0.10. and I'm still seeing this issue with S3. Smallest reproducible I could come up with: from deltalake.fs import DeltaStorageHandler
# setup storage_options using boto3
store = DeltaStorageHandler(
"s3://my_bucket/path/to/delta",
storage_options,
)
# Works
store.open_input_file(
"p=x/part-00000-3209f4d9-01c8-4e49-a01b-c5f9a6f61262.c000.snappy.parquet"
)
# None of these work
store.open_input_file("p=x%253Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x%3Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
|
@mattfysh can you provide listing of your bucket so we can see how exacly the file you are trying to read is named in S3? I was able to successfully read the file using Rust in the following example: Create and upload file to S3: # create new empty bucket
awslocal s3api create-bucket --bucket delta
# create dummy file
echo 'test' > part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
# uload it to local s3
awslocal s3 cp part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet s3://delta/path/to/delta/p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet Read the file using Rust and verify the file contents: use std::collections::HashMap;
use bytes::Bytes;
use object_store::ObjectStore;
use object_store::path::Path;
use url::Url;
use deltalake::storage::DeltaObjectStore;
#[tokio::main(flavor = "current_thread")]
async fn main() -> Result<(), deltalake::errors::DeltaTableError> {
let store_url = Url::parse("s3://delta/path/to/delta").unwrap();
dbg!(store_url.path());
let store = DeltaObjectStore::try_new(
store_url,
HashMap::from([
(
"AWS_ACCESS_KEY_ID".to_string(),
"TESTACCESSKEY12345".to_string(),
),
(
"AWS_SECRET_ACCESS_KEY".to_string(),
"ABCSECRETKEY".to_string(),
),
("AWS_REGION".to_string(), "us-east-1".to_string()),
(
"AWS_ENDPOINT_URL".to_string(),
"http://localhost:4566".to_string(),
),
("AWS_STORAGE_ALLOW_HTTP".to_string(), "TRUE".to_string()),
]),
)
.unwrap();
let store_get_res = store.get(&Path::from("p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")).await.unwrap();
let data_res = store_get_res.bytes().await.unwrap();
assert_eq!(Bytes::from("test\n".to_owned()), data_res);
Ok(())
} |
Hey Tomas. I’m not near a computer right now but the files were written by
a Databricks pyspark job, and the folder contains the percent encoding of
the colon character.
In your example, it’s not using percent encoding so that could be why the
issue is not showing up? I can double check tomorrows thanks!
…On Sun, 11 Jun 2023 at 7:02 pm, Tomas Sedlak ***@***.***> wrote:
@mattfysh <https://github.com/mattfysh> can you provide listing of your
bucket so we can see how exacly the file you are trying to read is named in
S3?
I was able to successfully read the file using Rust in the following
example:
Create and upload file to S3:
# create new empty bucket
awslocal s3api create-bucket --bucket delta# create dummy fileecho 'test' > part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet# uload it to local s3
awslocal s3 cp part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet s3://delta/path/to/delta/p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
Read the file using Rust and verify the file contents:
use std::collections::HashMap;use bytes::Bytes;use object_store::ObjectStore;use object_store::path::Path;use url::Url;use deltalake::storage::DeltaObjectStore;
#[tokio::main(flavor = "current_thread")]async fn main() -> Result<(), deltalake::errors::DeltaTableError> {
let store_url = Url::parse("s3://delta/path/to/delta").unwrap();
dbg!(store_url.path());
let store = DeltaObjectStore::try_new(
store_url,
HashMap::from([
(
"AWS_ACCESS_KEY_ID".to_string(),
"TESTACCESSKEY12345".to_string(),
),
(
"AWS_SECRET_ACCESS_KEY".to_string(),
"ABCSECRETKEY".to_string(),
),
("AWS_REGION".to_string(), "us-east-1".to_string()),
(
"AWS_ENDPOINT_URL".to_string(),
"http://localhost:4566".to_string(),
),
("AWS_STORAGE_ALLOW_HTTP".to_string(), "TRUE".to_string()),
]),
)
.unwrap();
let store_get_res = store.get(&Path::from("p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")).await.unwrap();
let data_res = store_get_res.bytes().await.unwrap();
assert_eq!(Bytes::from("test\n".to_owned()), data_res);
Ok(())}
—
Reply to this email directly, view it on GitHub
<#1228 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADSZ4USKDMESJH6UUQ4EIDXKWCRXANCNFSM6AAAAAAV6QQUEE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hey @mrjoe7 - any thoughts on why test 5 doesn't behave the same as test 1 & 2? I am using the same literal path value present in S3 |
Hi @mattfysh. It's because how some characters are encoded by object_store that delta-rs is using internally. This is what each of your test paths will get encoded into (not including the
As you can see |
Thanks @mrjoe7 - I was wondering if there is any way to change this behaviour? Because the default behaviour of Databricks (spark/hadoop) is to percent-encode these partition column values. I understand they do this to support Windows machines. Here is the spark job I ran to produce the structure on S3:
Then when i try to query this table with delta-rs, it fails due to this encoding issue |
It looks like there may be an option when using object_store to prevent it from re-encoding the percent sign: apache/arrow-rs#3651 I don't know Rust very well, but could this be something that can be surfaced as an option into delta-rs and the python package? |
I am able to reproduce with a simple local example using the Python library. It appears the space is double encoded translating to %2520 instead of just %20 import datetime
from deltalake import write_deltalake
import pyarrow as pa
data = pa.table({"data": pa.array(["mydata"]),
"inserted_at": pa.array([datetime.datetime.now()]),
"partition_column": pa.array(["hello world"])})
write_deltalake(table_or_uri="./unqueryable_table", \
mode="append", \
data=data, \
partition_by=["partition_column"]
) Output > tree unqueryable_table
unqueryable_table
├── _delta_log
│ └── 00000000000000000000.json
└── partition_column=hello%2520world
└── 0-8a13288c-f252-43f8-9c2c-c38416e7296c-0.parquet
2 directories, 2 files |
We're still encountering issues when trying to interface with Spark Databricks. We see different partition formats from different clients/libraries. Re-opened in: #1651 |
Environment
Delta-rs version: 0.8.1
Binding: python
Environment:
Bug
What happened:
I'm saving a partitioned table in Databricks (DBR 12.1) into Azure Data Lake Storage V2 which is partitioned on a string column containing slashes. When trying to read the table using deltalake from my local pc I get an error saying there is 'No body' in the parquet file.
If I query for a key without a slash it returns the data without issue.
What you expected to happen:
Succesfully read the data
How to reproduce it:
In Databricks saving to ADLS2:
Locally running in python (3.10) gives the above error:
Using the other value "hej" works fine:
returns:
More details:
The text was updated successfully, but these errors were encountered: