Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arbitrary Objects in Parameters #293

Open
jrdnmdhl opened this issue May 20, 2017 · 4 comments
Open

Arbitrary Objects in Parameters #293

jrdnmdhl opened this issue May 20, 2017 · 4 comments

Comments

@jrdnmdhl
Copy link
Contributor

Apologies, this is another one of my essays in which I suggest a potential course of action, request feedback, and ask whether I should develop this into a pull request.

Motivation

One of the current limitations of the parameters functionality is that it only works for atomic vectors. If this restriction could be relaxed, then it would open up a great deal of additional functionality. I can see a few advantages for this:

  • Solves Allow hr to be a parameter in apply_hr #261 by making survival distributions defined within the parameters object.
  • Allows for models to be more self-contained (extremely important when models are defined using tabular input style)
  • Effectively allows for bootstrapping of datasets (if a dataset is a parameter, then it can be sampled using bootstrapping and propagate uncertainty to anything dependent on it)

My understanding of what is available when evaluating parameters

As I understand it, parameters essentially have access to two namespaces when they are evaluated:

  1. The environment from which define_parameters was called. In the case of using tabular inputs, this environment as has any additional datasets supplied by the user. This environment can in principle hold any type of object.
  2. The evaluated parameters tibble (which at this point includes any previously evaluated parameters). Access to this namespace is provided automatically by dplyr and is why parameters can reference previously defined parameters. Since this namespace is a tibble, it only works well with atomic vectors.

How arbitrary objects could be supported

I think the key to implementing this is identify which objects belong in the environment or the tibble. In particular, I think time-dependent* atomic vectors belong in the tibble whereas all other values belong in the environment. The function to evaluate parameters can be modified to evaluate the parameters on their own, determine what type of object it is, and then assign them to the environment or tibble accordingly.

  • While it may seem like non-time-dependent atomic vectors should also be stored in the tibble, this actually creates more problems that it solves. As an example, the current partitioned survival model code needs to use a hack in which it evaluates the survival distributions only on the first row of the parameters tibble because otherwise it receives parameter values recycled to the number of cycles. By storing single-valued atomic vectors in the environment, we make sure they don't get recycled unnecessarily.

An partially working implementation and demonstration

In the example below, I define a data.frame as a parameter, pass it and another parameter to estimate a parametric survival model, define another survival distribution using apply_hr (using another parameter as the hr), and then calculate survival probabilities for those two distributions.

my_eval_params <- function(x, cycles = 1, strategy_name = NA) {
  
  # Get number of paramters
  n_par <- length(x)
  
  # Get parameter names
  par_names <- names(x)
  
  # Extract parent environment
  par_env <- x[[1]]$env
  
  # Set up the tibble
  res <- tibble::tibble(
    model_time = seq_len(cycles),
    markov_cycle = seq_len(cycles),
    strategy = strategy_name
  )
  n_row = nrow(res)
  
  # Evalute each parameter and add it to the tibble
  for(i in seq_len(n_par)) {
    # Evaluate parameter with tibble
    parval <- lazyeval::lazy_eval(x[[i]], data= res)
    
    # Instead of checking for time-depency here, I'll cheat by just checking the vector length
    simple_types = c("numeric", "integer", "character", "logical", "factor", "complex", "raw")
    if (class(parval) %in% simple_types && length(parval) == n_row) {
      # Else add it to the table
      res[[par_names[i]]] <- parval
    } else {
      # If it isn't a non-list vector w/ correct length, add it to the environment
      par_env[[par_names[i]]] <- parval
    }
    
  }
  
  
  res
}

#' @export
my_eval_params <- function(x, cycles = 1, strategy_name = NA) {
  
  # Get number of paramters
  n_par <- length(x)
  
  # Get parameter names
  par_names <- names(x)
  
  # Extract parent environment
  par_env <- x[[1]]$env
  
  # Use long-term tibble to represent model_time and state_time
  res <- tibble::tibble(
    model_time = seq_len(cycles),
    markov_cycle = seq_len(cycles),
    strategy = strategy_name
  )
  n_row = nrow(res)
  
  # Evalute each parameter and add it to the tibble
  for(i in seq_len(n_par)) {
    # Evaluate parameter with tibble
    parval <- lazyeval::lazy_eval(x[[i]], data= res)
    
    # Instead of checking for time-depency here, I'll cheat by just checking the vector length
    simple_types = c("numeric", "integer", "character", "logical", "factor", "complex", "raw")
    if (class(parval) %in% simple_types && length(parval) == n_row) {
      # Else add it to the table
      res[[par_names[i]]] <- parval
    } else {
      # If it isn't a non-list vector w/ correct length, add it to the environment
      par_env[[par_names[i]]] <- parval
    }
    
  }
  
  
  res
}

my_pars <- define_parameters(
  dist = "weibull",
  hr = 0.5,
  td = model_time + 1,
  surv_dat = bc,
  surv_dist = flexsurvreg(Surv(recyrs,censrec)~1,data=surv_dat,dist=dist),
  surv_dist_hr = apply_hr(surv_dist, hr),
  surv_prob = compute_surv(surv_dist, markov_cycle, type="surv"),
  surv_prob_hr = compute_surv(surv_dist_hr, markov_cycle, type="surv")
)

my_eval_params(my_pars, cycles = 5)

Result:

A tibble: 5 × 6

model_time markov_cycle strategy td surv_prob surv_prob_hr

1 1 1 NA 2 0.9062387 0.9519657
2 2 2 NA 3 0.7884550 0.8879499
3 3 3 NA 4 0.6716540 0.8195450
4 4 4 NA 5 0.5633823 0.7505880
5 5 5 NA 6 0.4667106 0.6831622

Note that the parameters written to the environment don't show up when the result of the parameters get printed, but those values are still "there" and accessible when any of these lazy express are evaluated.

Where it gets complicated

While this implementation is fairly trivial in a simple case, it does get a bit more complex with state expansion and in DSAs and PSAs. I'm not entirely sure how to extend this kind of implementation to those cases just yet.

@MattWiener
Copy link
Contributor

Jordan - I think the ability to have arbitrary objects as parameters would be useful, particularly for the data set as you mention. Currently survival fits and partitioned survival objects are stored in tibbles. Does that solve any part of the problem?

@jrdnmdhl
Copy link
Contributor Author

I don't think so. I think your work solves separate problems.

@pierucci
Copy link
Owner

Thanks a lot @jrdnmdhl. I see another benefit of keeping track of objects that are not vector parameters: it would allow us to fix #211 by providing an easy way to import those objects in the cluster nodes.

Furthermore allowing for re-sampling of fitted models has always been on my mind, so I'm interested in anything that goes in that direction.

I need some time to look at your proposition and think about ways to extend it to evaluation on new parameter values (the basis of PSA, DSA and heterogeneity analysis). You propositions are very welcome!

@jrdnmdhl
Copy link
Contributor Author

jrdnmdhl commented May 27, 2017

I've completed a proof of concept implementation which can be viewed here:
https://github.com/jrdnmdhl/heemod/tree/parameters_overhaul

@pierucci @MattWiener @zkabat Please let me know if you think this is moving in the right direction.

Summary of changes so far:

  • Parameters are represented as a list instead of a tibble. This means that any data type is supported and vectors will be of appropriate length if they are time-dependent.
  • Instead of expanding the unevaluated state values and transition matrices, all parameters, state values, and transition probabilities are evaluated for each relevant combination of state_time and model_time using vectorization, then reshaped into proper format using reshape2::acast. This potentially avoid some unnecessary looping and repeatedly evaluating the same formulas.

What remains to be done:

  • Clean up code
  • Ensure that result has everything it needs to and in proper format
  • Fix sensitivity analyses, partitioned survival models
  • Transition costs/effects
  • Update documentation + vignettes

Other problems:

  • The fact that parameter evaluation is done once per strategy is problematic. This means that all parameters get evaluated twice, which is not ideal when estimation is include in parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants