Partial branch invalidity/downloading a time series #591

tel · 2021-08-10T15:01:49Z

tel
Aug 10, 2021

Hi targets community,

I'm interested in modeling a financial time series problem using targets. This poses a somewhat unique challenge: a substantial number of targets are downstream of a large data set which is updated incrementally, daily.

Most of the computation downstream from this target can be done in parallel across the entire data set. Thus, most of the computation is not invalidated by one new day of data.

If this download task is triggered monolithically then all downstream tasks will get invalidated daily and lots of recompilation is necessary. A simple fix is to split the download task into dynamic batches (say cross(product, month)) and then front-load all of the parallelizable targets to exploit this. So far I've implemented this method successfully.

Unfortunately, this forces one download operation per cross(product, month). This is highly inefficient! On a fresh load, it's possible to obtain the entire data set with a single API call. On an incremental load, one month will be invalidated for each product, but again a single API call could be used to update all of the data.

What I'd like to do is to tap into one of two mechanisms to make this whole system efficient:

If I could access the previously cached value of this target then I could treat it monolithically, examine the target for staleness, and execute a minimal "update" function. Then, I could use group_by batching to ensure downstream tasks respect that most chunks of the data have remained valid.
Alternatively, or similarly, if I could access for a given dynamic branching target which branches have been invalidated then I could combine those into a single efficient query.

In either case, the need is to be able to access some historical or "meta" information in the target's command. Is there an existing mechanism to do something of this nature? Alternatively, do others with experience working on regularly-updating time series using targets have some best practices for this situation?

Thanks,
Joseph

Answered by wlandau

Aug 11, 2021

If this download task is triggered monolithically then all downstream tasks will get invalidated daily...

It's actually possible to avoid most of that invalidation without getting too low-level. Sketch:

# _targets.R file
library(targets)
library(tarchetypes)
list(
  tar_group_by(
    name = large_dataset,
    command = download_full_dataset(),
    product_column,
    month_column,
    cue = tar_cue_age(large_dataset, as.difftime(1, units = "days"))
  ),
  tar_target(
    name = analysis,
    command = analyze_subset(large_dataset),
    pattern = map(large_dataset)
  )
)

The full dataset will download every day, but the branches of analysis will not rerun if the corresponding row groups…

View full answer

wlandau · 2021-08-11T15:23:50Z

wlandau
Aug 11, 2021
Maintainer

If this download task is triggered monolithically then all downstream tasks will get invalidated daily...

It's actually possible to avoid most of that invalidation without getting too low-level. Sketch:

# _targets.R file
library(targets)
library(tarchetypes)
list(
  tar_group_by(
    name = large_dataset,
    command = download_full_dataset(),
    product_column,
    month_column,
    cue = tar_cue_age(large_dataset, as.difftime(1, units = "days"))
  ),
  tar_target(
    name = analysis,
    command = analyze_subset(large_dataset),
    pattern = map(large_dataset)
  )
)

The full dataset will download every day, but the branches of analysis will not rerun if the corresponding row groups of analysis did not change. targets detects the change with hashes, not timestamps, so it is possible to rerun a target without invalidating its downstream neighbors.

1 reply

tel Aug 11, 2021
Author

That's a good point, and probably the right way to handle this in the near term, thanks. I'm still interested in the prior case as I feel like this might be of recurring interest in time series applications where data sets often invalidate only very slightly, but I can just run with this for a while.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial branch invalidity/downloading a time series #591

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Partial branch invalidity/downloading a time series #591

tel Aug 10, 2021

Replies: 1 comment · 1 reply

wlandau Aug 11, 2021 Maintainer

tel Aug 11, 2021 Author

tel
Aug 10, 2021

Replies: 1 comment 1 reply

wlandau
Aug 11, 2021
Maintainer

tel Aug 11, 2021
Author