Support an S3 SQL Select file format #355

riazarbi · 2021-03-21T14:00:15Z

riazarbi
Mar 21, 2021

At present, we can write _targets/objects out to an s3 bucket. Each of the objects will be a file in a _targets prefix in the bucket.

Now, AWS S3 supports SQL select operations on objects under particular circumstances. PrestoDB and Trino also support SQL querying of S3 objects. Two important characteristics are -

Each dataset must be the only thing at a particular prefix (you can have multiple tables at that prefix, but they must all have the same columns and will be queried as a single table)
You must use a supported file format. Basically this boils down to json, CSV, or parquet, with your choice of compression on top.

My thinking is that, whether by design or not, targets is quite close to enabling SQL querying of the tabular objects of its entire processing pipeline. If one can write to an s3 bucket as a prefix per object and in a supported file format, one can build downstream tools that make use of SQL queries of those objects for dashboards or derivative models.

This could be really powerful when coupled with presto and apache superset, for instance.

I have verified that I can write objects to prefixes by simply giving the objects prefix-like names (ie instead of calling an object object1 I call it object1/data. It's awkward but it works

I have not been able to hack the file format to write csv or parquet (my preference is parquet). Before I go down that route, which will probably involve some ugly multi-stage targets per object, I thought I'd ask if this seems like a good idea and, if so, could we bake it in to the library natively?

It basically boils down to supporting aws_csv, aws_parquet and/or aws_json. I like parquet most of all because it preserves data types across languages and is more performant, but aws_csv should be much easier to implement (no additional libs etc).

wlandau · 2021-03-22T20:14:00Z

wlandau
Mar 22, 2021
Maintainer

Interesting use case. Have you tried using the "aws_file" format and manually saving the file to parquet?

2 replies

wlandau Mar 23, 2021
Maintainer

Yeah, on reflection, I would prefer to promote tar_target(..., format = "aws_file") for this unless you have a massive number of Parquet files or this pattern becomes super common. Otherwise, it may not be worth the extra tech debt of maintaining another set of formats for data frames on top of the existing fst-based ones.

wlandau Mar 23, 2021
Maintainer

You know what? I had another look at arrow, and it actually won't be so hard to support feather and parquet formats because they preserve the class of the data frame. No need to make special formats for tibbles or data.tables. The improvements in feather v2, including support for list columns, plus the Parquet advantages you mentioned make me think this is worth supporting.

riazarbi · 2021-03-24T05:59:26Z

riazarbi
Mar 24, 2021
Author

Thanks @wlandau . I'll switch to parquet after my next docker build completes and dogfood it for you.

0 replies

riazarbi · 2021-03-24T15:10:00Z

riazarbi
Mar 24, 2021
Author

Ok, format = parquet works well.

There were some issues, mostly to do with tar_targets that return strings. With fst_tbl these are silently coerced to data frames; with parquet you get an error.

Not sure if this is better or worse, but good to know.

I'll try aws_parquet next.

0 replies

riazarbi · 2021-03-24T16:26:27Z

riazarbi
Mar 24, 2021
Author

format = aws_parquet works well as well.

1 reply

wlandau Mar 25, 2021
Maintainer

Glad those formats work for you.

There were some issues, mostly to do with tar_targets that return strings. With fst_tbl these are silently coerced to data frames; with parquet you get an error.

Please supply data frames to formats that expect one. (I just assumed the coercion would just error out for non-2D objects.) The arrow package preserves the class of the data frame (could be a tibble, could be a data.table) so it is impossible to coerce it without breaking it in the general case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support an S3 SQL Select file format #355

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Support an S3 SQL Select file format #355

riazarbi Mar 21, 2021

Replies: 4 comments · 3 replies

wlandau Mar 22, 2021 Maintainer

wlandau Mar 23, 2021 Maintainer

wlandau Mar 23, 2021 Maintainer

riazarbi Mar 24, 2021 Author

riazarbi Mar 24, 2021 Author

riazarbi Mar 24, 2021 Author

wlandau Mar 25, 2021 Maintainer

riazarbi
Mar 21, 2021

Replies: 4 comments 3 replies

wlandau
Mar 22, 2021
Maintainer

wlandau Mar 23, 2021
Maintainer

wlandau Mar 23, 2021
Maintainer

riazarbi
Mar 24, 2021
Author

riazarbi
Mar 24, 2021
Author

riazarbi
Mar 24, 2021
Author

wlandau Mar 25, 2021
Maintainer