Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to connect to local timeseries target data #77

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

annakrystalli
Copy link
Member

@annakrystalli annakrystalli commented Feb 28, 2025

This PR resolves #71

It's the first step in accessing target data and focuses on accessing local timeseries target as a first pass.

It also adds a function for creating an expected schema for the timeseries target data that is applied during read.

Validation is not tackled here nor is reading target data from the cloud.

The tests demonstrate that the function can be used to successfully read time-series files that are:

  • in a single time-series.csv format.
  • in multiple complete files (i.e. all columns contained in each file) within a time-series directory (including with sub-directories)
  • in a hive-partitioned time-series directory

Note that when partitioning on a column that does not match any date task IDs (e.g. date column in timeseries is called date and the data are partitioned on it but the equivalent task ID is target_end_date), the date type is not automatically detectable, even if the files are parquet files. As such, I've included a date_col argument that allows us to specify that column explicitly. Eventually this should be configured via any target configuration file we decide on.

Also removed a couple explicit returns to appease lintr

Copy link

github-actions bot commented Feb 28, 2025

Including removing a couple of unnecessary explicit retuns
@annakrystalli annakrystalli marked this pull request as ready for review February 28, 2025 15:16
@annakrystalli annakrystalli marked this pull request as draft February 28, 2025 15:45
This allows for correct schema determination when timeseries data is partitioned on a date column that does not correspond to valid date task ID column.
@annakrystalli annakrystalli marked this pull request as ready for review February 28, 2025 16:18
Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I think the date_col argument is a good compromise.

The only thing I would suggest is to use {gert} instead of {git2r}. There are a few advantages:

  1. 10x smaller than {git2r}
  2. {usethis} uses it on the backend, so devs will likely already have it installed
  3. it's more "batteries included" than {git2r} (https://docs.ropensci.org/gert/#differences-with-git2r)


structure(ts_data,
class = c("target_timeseries", class(ts_data)),
ts_path = as.character(fs::path_rel(ts_path, hub_path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So glad you did this---adding the as.character() is always something that I forget to do with fs paths and it always bites me in the end.


expect_error(
create_timeseries_schema(tmp_hub_path),
"No .*date.* type column found in .*time-series.*."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this is a clever solution to getting around formatting issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create function to open local timeseries target data
2 participants